Conceptual factoring and unification of graphs representing semantic models

ABSTRACT

Techniques for factoring one or more source graphs into a composite graph containing nodes representing analogous elements of the source graphs and a variability graph containing nodes representing differences in the source graphs. The composite graph is made by taking analogous input trees from the source graphs and traversing the trees from top to bottom looking for nodes in each tree at each level that are analogous to the nodes at that level in the other input trees. The sets of analogous nodes are found by first automatically correlating the nodes in the level currently being examined. Correlation may, for example, be based on similar values of a property of the nodes being correlated. Representations of the sets of correlated nodes are then displayed to a user, who indicates which sets of correlated nodes are in fact analogous. The user may also indicate that the nodes in a set of correlated nodes are not analogous or that nodes that were found by the automatic correlation not to be autonomous are in fact. The analogous nodes are allocated to a corresponding node at a corresponding level in the composite graph; the other nodes are allocated to a set of anomalous nodes. One application for the techniques is managing graphs which are models of catalogs of items.

CROSS REFERENCES TO RELATED APPLICATIONS

[0001] The present patent application claims priority from U.S.provisional patent application No. 60/185,096, Dean T. Allemang and MarkA. Simos, Conceptual factoring and unification: an automated,human-in-the-loop procedure for factoring source metadata withrepetitive substructure and analogous content into multiple,nonredundant interacting semantic models, filed Feb. 25, 2000.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The invention relates generally to the manipulation ofrepresentations of graphs in computer systems and more specifically toautomated techniques for conceptually factoring and/or unifying graphs.

[0004] 2. Description of Related Art

[0005] Information is useful only if it is accessible. There are twosenses in which it must be accessible: those who need it must havephysical access to it, and it must be indexed or cataloged so that thosewho need a particular item of information can easily find what theywant. The data processing and communications revolutions of the secondhalf of the twentieth century made it possible both to store much moreinformation and to provide much more physical access to the storedinformation than ever before. The database technology component of thedata processing revolution also made data cataloging and indexing easierthan ever before, but the users of the information needed far moreflexibility in finding, viewing, and analyzing the information than therelatively rigid database systems could provide.

[0006] The development of electronic commerce, or E-commerce, madeflexible access to information more important than ever before. IfE-commerce is to succeed, a Web merchant has to offer the E-shoppereasier access to the goods or services being being sold than what theshopper can get by ordering from a mail-order catalog or by going to thelocal shopping mall. To the shopper, access is only easy if it is accessthe way the shopper wants to have it, and in the E-commerce context,that means that the Web merchant must offer the shopper as manydifferent ways to access the goods or services as there are kinds ofshoppers.

[0007] A particularly effective way of providing flexible access toinformation is that described in the PCT International ApplicationPCT/US00/01042, J. Anthony, A system for composing applications based onexplicit semantic models, event driven autonomous agents, and resourceproxies, filed Jan. 14, 2000 and published Jul. 20, 2000 asInternational Publication Number W0 00/42529. FIGS. 1-12 ofPCT/US00/01042 are included in the present patent application along withthose parts of the Detailed Description that describe them. The systemthat is the subject matter of PCT/US00/01042 will be termed in thefollowing the Ariadne system. In the Ariadne system, representations ofgraphs are used to organize information. Vertices in the graphsrepresent items of information and concepts that organize the items ofinformation and edges in the graphs represent relationships between thevertices. In E-commerce, the items of information are typically productdescriptions, while the concepts organize the product descriptions sothat the Web shopper can access them in various ways. For example, adescription of a given kind of shoe may be accessible not only via theconcept “shoes”, but via concepts such as “leather”, “men's wear”,“formal wear”, “color”, and so forth. The concepts themselves areorganized into models. Each model belongs to a particular model type.The model type for the model specifies the properties of the edges thatconnect the vertices representing the concepts. An overview of theAriadne system's graphs, models, and model types may be found in thesections Using graphs to specify multiple aspects of a collection ofdata through Relating concepts to the world in the Detailed Descriptionof the present patent application.

[0008] While Ariadne models make providing flexible access toinformation easier than ever before, the models must be made andmaintained. When Ariadne is used for E-commerce, for example, the modelsthat describe the products must be made. There is information and tospare in catalogs and databases about the products to be accessed usingAriadne models, but the models must still be made from the information Asolution to that problem is described in the PCT internationalapplication PCT/US01/02688. J. S. Anthony and Dean T. Allemang, Softwarecomposition using graph types, graphs, and agents, filed Jan. 26, 2001.As described in PCT/US01/02688, the Ariadne system uses graphs andagents, programs that are executed in response to events in the contextprovided by one or more models, to automatically convert catalogsrepresented in XML into Ariadne models. The same techniques can be usedto convert other legacy representations of information into Ariadnemodels.

[0009] The maintenance problem, however, remains. It has two aspects:eliminating redundant information in a single model and integratinginformation from different sources.

[0010] Eliminating Redundant Information

[0011] Because catalogs are linear, they contain much redundantinformation; this information remains in the Ariadne model made from thecatalog. The redundant information creates many problems:

[0012] Catalog size may increase, in some cases at a nonlinear raterelative to the number of truly is new categories that are being addedto the catalog.

[0013] It is difficult to maintain catalogues consistently: updates mayneed to be made at numerous points within a structure.

[0014] Any given catalog structure will favor only certain styles ofnavigating and querying the catalog, with inadequate user support forother styles and other scenarios. Awkward “climb-arouiid” navigation maybe required to move to a conceptually closely related topic that isdistant within the actual hierarchy. Mitigating this problem withancillary links specified by human catalogers does not scale or persistwell; such links are effort- and knowledge-intensive to create,maintain, and change over time.

[0015] Integrating Information from Different Sources

[0016] Reconciling information from different sources requires that theperson doing the reconciling understand the differences between thesources and make tradeoffs between standardization and inclusiveness.Beginning with understanding the differences between the sources, whenthere are discrepancies between subtrees of a vendor's product catalog,the discrepancies may represent different language choices of differentcatalogers on different days, temporary gaps in the product line, orlogical differences in the two contexts (say, men's vs. women'sclothing). When we are integrating models from independent sources (forexample, from two different vendor's catalogs) there are likely to beeven more discrepancies of this kind. We need a technique that makes itpossible to deal with such discrepancies quickly and in a uniformmanner.

[0017] In making the tradeoff between standardization and exclusiveness,current technologies allow only two approaches: either a “one size fitsall” approach which requires that each source of metadata conform to asingle set of categories or a “kitchen sink” approach which takes theunion of all the categories represented by all the sources. Hybridapproaches, like a fixed standard or “generic” model which defers tolocal models for any non-common sources, do not escape the problems ofthe two basic approaches.

[0018] There are numerous drawbacks to each approach.

[0019] For the standard model approach:

[0020] Standard sets of categories are often strongly resisted bydifferent stakeholders in the business context—this is the case bothwithin the enterprise, as in efforts at knowledge dissemination andcentralized knowledge sharing, and in cross-enterprise contexts likebuisness-to-business (B2B E-commerce.

[0021] Where standard sets pf categories can be adopted, there must be adesign process for creating the standards; and this process, if notsimple creation of categories by fiat, must involve some systematicstudy of candidate sources to synthesize a standard.

[0022] Once the standard set of categories is designed, each metadatasource must do an initial conversion of its material to fit thatstandard. This is also an effort and knowledge intensive process.

[0023] Unless the independent information sources convert their owncatalogs to the single standard, additional work will need to be doneevery time new inventory is made accessible via the standardizedcategories. The problem becomes more acute when there is a need toevolve the separate categories by, say, adding new lower-levelcategories. These must somehow be reconciled with the standard.

[0024] For the “kitchen sink” union of all local metadata approach:

[0025] The solution winds up with many spurious duplicate categories inthe main model. Some might represent true duplicates, others mightrepresent homonyms or categories from different sources with a commonname but different interpretation.

[0026] Even where the categories have distinct names, the converseproblem exists. Sometimes the different names represent significantdifferences in the categorized content, other times, the different namesare names used in different contexts for similar items.

[0027] Interpreting these connections between categories is certainlydifficult to do. But if it is not done when the main catalog is made, wesimply burden the user of the main catalog with the work. The user willneed to make these interpretations every time a search for a specificitem is performed, the user will have to make the interpretationswithout the catalogers' knowledge, and will have to do this even thoughthe correct interpretations change slowly, if at all. So the “union”solution is in effect a non-solution that leaves the user to deal withproblems that should have been solved by the catalogers.

[0028] The union approach does create categories that provide access,via a single query, to content from multiple original sources. Forexample, if clothing from different catalogs were integrated in thisway, “Women's Garments” would be listed from Merchandiser A, “Women'sClothes” from Merchandiser B. No category would show both in a commonquery. Even if the system supported queries on multiple categoriessimultaneously (effectively, union vs. intersection operations) toreturn the content classified in multiple categories within a singlelist, the user would still need to know that these two categories werethe ones to select. Of course, the more sources there are, the greaterthe burden on the user to select the proper categories.

[0029] There are some obvious other operational drawbacks, such as:large relative size of the common “kitchen sink” catalog; sensitivity ofthat catalog's look and feel to local changes made by separate catalogsources (either new categories get migrated in, changing the maincatalog; or they don't and they have no connectivity to the maincatalog); and the fact that, to disambiguate the categories, the typicaltrend will be towards explicit inclusion of the information source aspart of the category (e.g., “Merchandiser A Women's Clothes”,“Merchandiser B Women's Garments”). At this point the union catalog isproviding little value other than a single point of access for multiplecollections. No real semantic integration has been performed.

[0030] For both the approaches listed above, there is another seriousdrawback. So far we have discussed the integration problem from thepoint of view of the information source, e.g., a merchandiser's catalog.But, particularly in a B2B context, the consumer, procurement or demandside will have the same needs for customized categories to streamlinerepeated buying decisions. Neither the “one size fits all” nor the“kitchen sink” approach provides any means to support customer-specificviews into the category system of this kind.

[0031] The reason that present approaches fail to address so many ofthese problems is that they all attempt to coordinate multipleinformation sources using an information representation that is no morepowerful than the one used in the sources. For instance, we cannotreconcile discrepancies among several taxonomies with a single taxonomywithout resorting to either one or the other of “one size fits all” and“kitchen sink”. The solution is to migrate to a richer semanticframework such as the one provided by the Ariadne system. To make themigration, techniques are needed for transforming existing informationsources into richer semantic frameworks. Providing such techniques is anobject of the present invention.

SUMMARY OF THE INVENTION

[0032] The techniques automate the operation of combining models. Withthe techniques, systems can be built which permit a user to easily andefficiently produce a constellation of factored models from one or moresource models. The constellation of factored models includes a compositemodel in which common aspects of the source models are combined and avariability model which contains the differences between the models. Theconstellation loses none of the information of the source models andallows the information of the source models to be accessed in ways notpossible with the source models.

[0033] A key technique in making the composite model is unifyinganalogous nodes of the source models in the composite model. Thetechnique automatically correlates child nodes belonging to parent nodesfrom one or more graphs other than the composite model with each otherand with any child nodes of a parent node in the composite model andthen displays the correlations in a user interface. A user thenindicates whether he or she takes the correlated nodes to be analogous;if they are and they are not correlated with a child node that isalready in the composite node, child nodes corresponding to thecorrelated nodes are added to the child nodes of the composite model'sparent node. The above technique can be used with input trees from thesource models; in this case, the technique can be employed recursivelyto unify analogous nodes at all levels of the input trees.

[0034] Another key technique in making the composite model is using thestructure of the source models to determine whether a node from one ofthe source models is correlated with a node from another of the sourcemodels. In this technique, the first node's relationship to at leastanother node in its source model is analyzed to produce a first resultand the second node's relationship to at least another node in itssource model is analyzed to produce a second result, and the results areused to determine at least in part whether the first node is correlatedwith the second node.

[0035] Other objects and advantages will be apparent to those skilled inthe arts to which the invention pertains upon perusal of the followingDetailed Description and drawing, wherein:

BRIEF DESCRIPTION OF THE DRAWING

[0036]FIG. 1 illustrates how graphs may be used to show relationshipsamong entities;

[0037]FIG. 2 shows a complex model;

[0038]FIG. 3 shows how the concepts of a model are related to instancesand agents;

[0039]FIG. 4 shows the structures that represent model types, models,concepts, and instances in a preferred embodiment;

[0040]FIG. 5 is an overview of a system in which models and model typesare implemented;

[0041]FIG. 6 is an overview of views and viewers in the system of FIG.5;

[0042]FIG. 7 shows a user interface for defining a new model;

[0043]FIG. 8 shows a user interface for defining a root concept;

[0044]FIG. 9 shows a user interface for adding a subclass concept to amodel of the taxonomy type;

[0045]FIG. 10 shows a user interface for adding an instance to a conceptof a model;

[0046]FIG. 11 shows a user interface for adding a referent to aninstance;

[0047]FIG. 12 shows a user interface for displaying a model;

[0048]FIG. 13 shows an example CFU transform;

[0049]FIG. 14 is a conceptual diagram of the simplest CFU transform;

[0050]FIG. 15 is a conceptual diagram of two more complex CFUtransforms;

[0051]FIG. 16 is a diagram of a graphical user interface for matchingconcepts;

[0052]FIG. 17 is a high-level flowchart of a procedure for making a CFUtransform;

[0053]FIG. 18 is a block diagram of a system for making a CFU transform;

[0054]FIG. 19 is a flowchart of the factor_models agent in a preferredembodiment;

[0055]FIG. 20 is a flowchart of the recursive factor-models-fn functionin a preferred embodiment;

[0056]FIG. 21 shows a first two windows from the user interface employedin a preferred embodiment;

[0057]FIG. 22 shows a second two windows from the user interface;

[0058]FIG. 23 shows another window from the user interface;

[0059]FIG. 24 shows an example of operation of the CFU transform;

[0060]FIG. 25 shows CFU transforms involving correlations at differentlevels of the input trees;

[0061]FIG. 26 shows another CFU transform involving correlations atdifferent levels of the input trees;

[0062]FIG. 27 shows a CFU transform involving multi-level factoring;

[0063]FIG. 28 shows how the CFU procedure might deal with the problem ofFIG. 27;

[0064]FIG. 29 shows possible solutions of the problem of FIG. 27;

[0065]FIG. 30 shows another possible solution of the problem of FIG. 27;and

[0066]FIG. 31 shows how an anomalous concept may be dealt with.

[0067] Reference numbers in the drawing have three or more digits: thetwo right-hand digits are reference numbers in the drawing indicated bythe remaining digits. Thus, an item with the reference number 203 firstappears as item 203 in FIG. 2.

DETAILED DESCRIPTION

[0068] The first part of the Detailed Description is an overview of theAriadne system from PCT/US00/01042; the description of the techniquesfor conceptual factoring and unification employed in the Ariadne systembegins with the section Conceptual factoring and unification.

[0069] Using Graphs to Specify Multiple Aspects of a Collection of Data:FIG. 1

[0070] For purposes of the following informal discussion, the term graphis used in the sense of a set of points where at least one of the pointsis connected to itself or another point by an arc. The points are termedthe vertices of the graph and the arcs are termed its edges. In thegraphs used in the invention, the vertices represent entities such asconcepts and the edges represent relationships between the concepts. InFIG. 1, graphs are used to represent a taxonomy 101 of concepts relatingto clothing. The concepts belonging to a given taxonomy are related toeach other in both a top-down fashion, i.e., from the most generalconcept to the least general concept, and a bottom-up fashion, i.e.,from the least general concept to the most general. In the top-downrelationship, the concepts are related as class and subclass; forexample, in taxonomy 101, footwear is a subclass of clothing andinsulated boots is a subclass of footwear. The bottom-up relationship istermed an is a relationship, i.e., insulated boots is one of theconcepts of footwear and footwear is one of the concepts of clothing.

[0071] Thus, in taxonomy 101, each vertex 103 represents a conceptrelating to clothing, and edges 105 connect the vertices 103. Thearrowhead on the edge indicates the direction of the relationship. Thereare two graphs in FIG. 1; one graph, indicated by dashed straight lines107, indicates the subclass relationships between the conceptsrepresented by the vertices; the other graph, indicated by solid arcs109, indicates the is a relationships. Thus, graph 107 shows thatouterwear 113 and footwear 115 are subclasses of clothing 111 and parkas117 and raingear 119 are in turn subclasses of outerwear 113. Further,as shown by solid arcs 109, sandals 121 has an is a relationship tofootwear 115, footwear 115 has an is a relationship to clothing 111, andso forth for the other concepts. Each concept has a solid arc 119pointing to itself because each concept is itself, and therefore has anis a relationship with itself.

[0072] Subclass graph 107 and is a graph 109 thus organize the set ofclothing concepts in FIG. 1 according to two aspects: a subclass aspectand an is a aspect. Subclass graph 107 tells us that outerwear 113 hastwo subclasses: parkas 117 and raingear 119; is a graph 109 tells usthat outerwear 113 is clothing 111. Graphs 107 and 109 make it possibleto consider any concept in taxonomy 101 from the point of view of itssubclass relationships to other concepts and from the point of view ofits is a relationships to other concepts. The operation of consideringan entity in taxonomy 101 first as it belongs to one of the graphs andthen as it belongs to another of the graphs is termed pivoting. Theconcepts of FIG. 1 can of course have relationships other than those oftaxonomy 101, and those relationships, too, can be represented by graphsmade up of concepts belonging to the set shown in FIG. 1 and edgesconnected to them. Each such graph organizes the set of clothingconcepts according to another aspect, and pivoting permits a givenconcept to be seen according to any of the aspects represented by any ofthe graphs that the concept belongs to.

[0073] Models and Facets: FIG. 2

[0074] Taxonomy 101 is of course only one of many possible ways oforganizing the set of concepts shown in FIG. 1. In the followingdiscussion, a particular way of organizing a set of concepts or otherentities is termed a model. Thus, in FIG. 1, the concepts are organizedaccording to a taxonomy model. As we have seen, when concepts areorganized in this fashion, the relationships between them are shown bytwo graphs: subclass graph 107 and is a graph 109; each of these graphsis termed a facet of the model; thus the taxonomy model of FIG. 1 has asubclass facet 107 and an is a facet 109. The pivoting operation permitsa concept in the set to be considered according to each of the facetsthat the concept belongs to.

[0075] The model of FIG. 1 is simple, i.e., it is a single taxonomy. Amodel may, however, also be complex, i.e., composed of two or moremodels. FIG. 2 shows such a complex model 201. In FIG. 2, the set ofconcepts of FIG. 1 has been expanded so that the items of clothing canbe organized according to the season they are appropriate for. The newconcepts represent the five seasons of the New England climate: winter205, mud season 206, spring 213, summer 207, and fall 215. The set ofconcepts shown in FIG. 2 is organized according to complex model 201,which in turn is made up of two simple models. Clothing taxonomy model209 is the taxonomy model shown in FIG. 1; seasonal clothing model 211is a model of type simple graph which relates concepts representingclothing to concepts representing the five New England seasons. Thefacets of model 211 relate a season concept to clothing concepts for thekinds of clothing worn in the season and a clothing concept to theseasons in which the clothing is worn. The concepts parkas 117, raingear119, sandals 121, and insulated boots 123 belong to both models.Considered as part of clothing model 209, sandals 121 is a subclass offootwear 115; considered as part of the seasonal clothing model, sandals121 is related to the seasons in which sandals are worn, namely spring,summer, and fall. Outerwear 113, on the other hand, belongs only toclothing model 209, while winter 205 belongs only to seasonal clothingmodel 211.

[0076] Complex models permit additional operations. For instance,pivoting may be used with complex model 201 to consider a given conceptaccording to each facet of each of the models the concept belongs to.For example, the concept sandals may be considered on the one hand as itis related to the concepts of clothing model 209 and on the other as itis related to the concepts of seasonal clothing model 211. Moreover,since each model organizes the concepts in different ways, the modelsdefine different sets of concepts and set operations such as union,intersection, difference, and set xor may be applied.

[0077] Model Types

[0078] Any set of entities which belongs to a taxonomy can be organizedby means of a taxonomy model like model 209. Just as all taxonomies arealike in how they organize the entities that belong to them, anytaxonomy model will have an is a facet and a subclass facet and similarrelationships will exist between the entities belonging to a givenfacet. Moreover, any user of a taxonomy model will want to performsimilar operations using the taxonomy. For example, a user will want todisplay all of the concepts that are subclasses of a given concept orall of the concepts that a given concept has an is a relationship with.One can thus speak of the taxonomy model type, and all other models willsimilarly belong to model types. As with models, a model type may beeither simple or complex. Because all models belonging to a given modeltype have similar operations, it is possible to define those operationsfor the model type and make them automatically available for any modelof the type.

[0079] In the present invention, users of the invention may define theirown model types or use model types defined by others. A model type isdefined as follows:

[0080] a facet specifier specifies each of the facets belonging tomodels of the type;

[0081] within each facet specifier, a relation specifier that specifieshow entities joined by an edge of the facet are related;

[0082] a propagation specifiers for the facets and/or the entire model;a propagation specifier specifies how operations belonging to modelshaving the model type are performed.

[0083] The model type for the taxonomy model thus has a subclass facetspecifier for the subclass facet and an is a facet specifier for the isa facet. The relation specifier for the subclass facet specifies thatthe subclass relationship is transitive, non-reflexive, andnon-symmetric. The fact that the relationship is transitive means thatif entity A is a subclass of entity B and entity C is a subclass ofentity B, then entity C is a subclass of entity. A, or in terms of FIG.1, that parkas 117 is a subclass of clothing 111. The fact that thesubclass relationship is non-reflexive means that an entity cannot be asubclass of itself (which is why there are no edges of subclass graph107 connecting an entity to itself). The fact that the relationship isnon-symmetric means that if entity B is a subclass of entity A, entity Acannot be a subclass of entity B or in terms of FIG. 1, if parkas 117 isa subclass of outerwear 113, outerwear 113 cannot be a subclass ofparkas 117. The relation specifier for the is a facet specifies that theis a relationship is transitive, reflexive, and non-symmetric. Thus, asshown in FIG. 1, parkas 117 is itself as well as outerwear and clothing,but if parkas are outerwear, then outerwear cannot be (just) parkas.

[0084] The relation specifiers are used to define procedures for addingconcepts to models belonging to the class. For instance, if newconcepts, say swimwear, bathing suits, and wetsuits are added to themodel of FIG. 1, with swimwear being a subclass of clothing and bathingsuits and wetsuits being subclasses of swimwear, the relation specifierswill ensure that there are edges in the subclass facet connectingclothing to swimwear and swimwear to bathing suits and wetsuits, but noedges in the subclass facet connecting clothing to wetsuits or bathingsuits to wetsuits, and will similarly ensure that there are edges in theis a facet connecting each of the new concepts to itself and wetsuitsand bathing suits to swimwear and swimwear to clothing, but no edgesconnecting wetsuits and bathing suits to clothing and none connectingwetsuits and bathing suits to each other.

[0085] One example of a propagator for a taxonomy is a subclass displaypropagator that displays all of the subclasses belonging to a class. Thesubclass display propagator works by simply following the subclass facetbeginning at the specified class. Thus, if the class is clothing, thedisplay propagator will display outerwear 113, parkas 117, raingear 119,footwear 115, sandals 121, and insulated boots 123. Another example isan is a display propagator that displays the concepts that the specifiedconcept belongs to. This propagator simply follows the is a facetbeginning at the specified concept. Thus, for sandals 121, it willdisplay sandals 121, footwear 115, and clothing 111.

[0086] Relating Concepts to the World: FIG. 3

[0087] In order to be useful, the cards in a library card catalog relatethe concepts used in the catalog to books in the library. The same istrue with concepts organized by models. In order for the concepts to beuseful, they must be related to entities that are examples of theconcepts. In the invention, an entity that is or may be an example of aconcept is termed an instance, and an instance that is an example of aconcept is termed an instance of the concept. It should be pointed outhere that one of the things which may be an example of a concept is amodel, and thus, an instance may be a model. Using models as instancesin other models is one way of making complex models.

[0088] All of the instances available to a system in which the inventionis implemented is termed the world of the system. In general, one makesa model to deal with a given area from several aspects, and this area istermed the model's subject. For example, the subject of model 209 isclothing and all of the instances of its concepts represent items ofclothing. One thus makes a model for a subject and then relates themodel to instances in the world that are relevant to the model'ssubject. The instances in the world that are relevant to a given subjectare termed the subject's collection.

[0089]FIG. 3 shows how concepts are related to instances in a preferredembodiment. FIG. 3 shows a set 301 of instances representing objectsaccessible to the system upon which model 209 is being used. This set301 is termed herein the world of the model. The subject of model 209 isclothing; in FIG. 3, instances belonging to clothing's collection aresurrounded by a curve, as shown at 306. Thus, in FIG. 3, model 209 isbeing applied to world 301, but the instances with which it is actuallyconcerned belong to clothing collection 306. Item instances in clothingcollection 306 are consequently termed clothing instances 307. Theinstances in clothing collection 306 with which model 209 is concernedall represent items of clothing or agents, as will be explained below;however, other instances in clothing collection 306 may representmodels. Of course, more than one set of concepts may apply to a subjector a world and a given set of concepts may be applied to differentsubjects or worlds.

[0090] There are two kinds of instances in world 301: item instances303, which represent items, including other models, that may be relatedto concepts, and agent instances 304, which represent programs that areexecuted by models in response to the occurrence of events such as theaddition of a concept to the model or a request by a user to view itemsbelonging to a given concept. While the program represented by an agentmay be any program at all, the program executes in the context of themodel and can thus take advantage of the model's facets and propagators.In effect, the operations defined for the model are available to agentsin the same fashion that programs belonging to run-time libraries areavailable to application programs.

[0091] The mechanism by which an item instance 303 or an agent instance304 is related to a concept is an instance facet 309. There is aninstance facet 309 for each instance that is related to a given concept.Thus, instance facets relate clothing instances 307(b and c) to concept121. Of course, an instance may have instance facets connecting it tomore than one concept and even to concepts belonging to differentmodels. Generally, the item represented by an instance has anotherrepresentation, termed an object, in the computer system. What kind ofobject an instance represents will depend on the application for whichthe invention is being used. For example, the clothing instances mightrepresent database identifiers of rows describing products in a databasetable describing a clothing company's products or they might be URLs ofWEB pages describing the products.

[0092] Propagators may work on instances as well as concepts. Forexample, a propagator may be defined for the taxonomy model type whichretrieves all of the instances associated with a concept and itssubclasses. It does so by first following the instance facets for theconcept and retrieving all of the concept's instances. Then it followssubclass facet 107 from the concept to its subclasses, their subclasses,and so on down to concepts which have no subclasses. At each concept,the propagator retrieves the instances associated with the concept.Thus, in FIG. 3, when the propagator is applied to concept 115, it willretrieve the clothing instances 307 labeled a,b,c,d in collection 306.

[0093] One agent instance is shown in collection 306: the instance forrefinement agent 308. Refinement agent 308 is executed when a conceptrepresenting a new subclass is added to model 209. For example, in model209 as shown in FIG. 1, the concept footwear 115 has two subclasses:sandals 121 and insulated boots 123. Instances which belong to neitherof those subclasses belong to footwear. One such instance, 307(a), isshown in FIG. 3. The instance represents gardening clogs. Now, the userof the model is planning to sell more kinds of clogs and consequentlydecides to add the concept clogs as a subclass of footwear. When that isdone, instance 307(a) should become an instance of clogs rather than aninstance of footwear. This process of moving an instance into the propersubclass concept is termed refinement, and refinement agent instance 308automatically does refinement whenever a subclass concept is added tomodel 209.

[0094] In FIG. 3, refinement agent instance 308 is shown attached toclothing concept 111 and to footwear concept 115. Clothing concept 111is the broadest concept in the model and is termed the root concept ofthe model. Of course, every model of type taxonomy has a root concept.In models of the taxonomy type, an agent attached to a conceptpropagates along subclass facet 107; thus, any concept which is asubclass inherits the agent. Consequently, each concept in model 209 hasits own copy of refinement agent instance 308. In FIG. 3, only thecopies for clothing 111 and footwear 115 are shown. Since each concepthas its own copy of refinement agent instance 308, execution of theagents can be done in parallel.

[0095] When the user adds the new subclass clogs to footwear 115, thatevent causes refinement agent instance 308(k) to execute. The programfollows the subclass facet to the new subclass concept clogs andexamines it to determine whether any of the item instances that arerelated to it are also related to footwear 115. One such item instance,garden clogs, is, and the program rearranges the instance facets 309 sothat there is now an instance facet relating clogs to garden clogs, butno longer an instance facet relating footwear to garden clogs. As can beseen from the foregoing, an agent, while user-defined, operates withinthe context of the environment provided by the model and takesadvantages of the operations defined for the model's type.

[0096] Representing Models, Concepts and Instances: FIG. 4

[0097]FIG. 4 shows at 401 how the representations of model types,models, concepts, and instances are structured in a preferredembodiment. In overview, as shown by the arrows in FIG. 4, each modeldefinition 413 refers to a model type definition for its model type andto a set of node structures. Some of the node structures representconcepts belonging to the model and others represent instances of theconcepts. Each concept node 425 refers to its model and each instancenode 437 refers to the concepts the node is instances of. There may bemany models of a given model type, a given model may have many concepts,a given concept may have many instances and a given instance may be aninstance of many concepts. A model type definition may thus be locatedfrom any model definition of its type, a model definition may be locatedfrom any of its concepts, and a concept may be located from any of itsinstances.

[0098] Continuing in more detail, model type definition 403 includes themodel type's name 405, a description 407 of the model type, a facetspecifier list 409 that specifies the kinds of facets that models of thetype have, and a propagator list 411 that specifies the propagators formodels of the type.

[0099] Model definition 413 includes the model's name and description at415 and 417, a list 419 of the concept and instance nodes in the model,a facet list 421 showing how the model's nodes are related by each facetof the model, and a model type name 423, which refers back to the modeltype definition 403 for the model.

[0100] Concept node 425 includes the concept's name and description at427 and 429, a property list 431, which is a list of user-definedproperties of the concept, and attribute list 433, which is a list ofattributes for the concept. Each attribute specifies the name of a facetto which the concept node belongs and the name of the node which is thenext neighbor of the concept node in the facet. The facets, andcorrespondingly, the attributes may be subdivided into model facets,which specify facets whose vertices are made up only of concepts of themodel, and instance facets, which specify facets connecting concepts andinstances. What kinds of model facets a model has is determined by itsmodel type; in a preferred embodiment, there are three kinds of instancefacets that run from the concept to an instance:

[0101] item facets, which connect a concept to an item instancerepresenting an item that belongs to the concept;

[0102] exhibitor facets, which connect a concept to an item instancerepresenting an item that possesses a property specified by the concept;and

[0103] action facets, which connect a concept to an agent instance.

[0104] Exhibitor facets are used to deal with concepts like color. Ablue clog, for example, exhibits the property of being blue and wouldtherefore be connected to a concept representing the color blue by anexhibitor facet. Owning model 435, finally, refers to model definition413 for the model the concept belongs to.

[0105] Instance node 439, finally, has an instance name 439, an instancedescription 441, and a property list 443 for the instance. Included inproperty list 443 is referent 445, which specifies how to locate theobject represented by instance node 439. What the referent is depends onwhat kind of object the instance node represents. For example, if theinstance node represents a Web page, the referent will be the page'sURL; if it represents an agent, it may be a pathname for the agent'scode; if it represents another model, the referent will be the model'sname. Attribute list 447, finally, specifies the instance facets thatrun from the instance to the concepts it belongs to. There is one suchfacet corresponding to each of the instance facets running from theconcept to the instance. Each of these facets is termed the dual of thecorresponding facet. Thus, the item of facet is the dual of the itemfacet; exhibitor of is the dual of the exhibitor facet; and action of isthe dual of the action facet.

[0106] Applying all of the foregoing to concept 115 of model 209, we seethat concept node 425 for that concept has model attributes for thesubclass facet for concepts 121 and 123 and for the is a facet foritself and for concept 111, an item instance attribute for clothinginstance 307(a), and an action instance attribute for refinement agentinstance 308(k). Instance node 437 for clothing instance 307(a) has anitem of instance attribute for concept 115 and the instance node forrefinement agent instance 308(k) has an action of attribute for concept115.

[0107] In a preferred embodiment, the structures that make up thecomponents of a model are all linked by name, and hash functions andhash tables are used to relate names in the structures to the locationsof the structures in memory. For example, to find a concept instance,the preferred embodiment takes the name and presents it to a hashfunction, which hashes the name to obtain an index of an entry in a hashtable and uses the index to find the entry for the name in the hashtable; that entry contains a pointer to the location of the conceptinstance. In other embodiments, other techniques such as pointers mightbe used to link the components of the structures 401 that represent amodel.

[0108] A System that Uses Models to Organize Information: FIG. 5

[0109]FIG. 5 is an overview of a system 501 that uses models to organizeinformation. The system, called Ariadne, has three major components:

[0110] server 509 maintains the data structures 401 that implement modeltypes, models, and instances, together with views 513, which providelogical descriptions of models and their parts, but do not specify howthe model will appear in a specific GUI.

[0111] a number of viewers 507, which present the contents of the viewsas required for particular graphical user interfaces (GUIs); and

[0112] ERIS (external resource interface system) 505, which providesaccess to the systems 503 that contain the objects represented byinstances 407.

[0113] Server 509 may be implemented on any kind of computer system, andviewers 507 may be monitors, Web browsers, PC's or other systems thathave either local or remote access to the computer system upon whichserver 509 is implemented. As shown in FIG. 5, the outside systemsaccessed via ERIS 505 may include relational database systems, with theobjects being records or queries, Web servers, with the objects beingWeb pages, email systems, with the objects being email messages, andsystems that use XML as their interface to other systems. The viewers507 and the components of ERIS 505 interact with the model types,models, agents, views, and instances by way of interfaces 511 definedusing Interface Definition Language (IDL).

[0114] An example of how system 501 functions is the following: A userof a viewer 507(i) is interacting with clothing model 209 via agraphical user interface and wishes to see all of the instances offootwear that are currently available in collection 306 of clothingmodel 209. The user specifies footwear concept 115 and a “displayinstances” operation. This operation specification arrives via IDL 511in server 509, and the propagator for the taxonomy model type whichretrieves instances retrieves the instances that are related to conceptsfootwear 115, sandals 121, and insulated boots 103. Ariadne server 509then typically makes a list of the instances represented by the objectsfor display in viewer 507(i). If the user of the viewer selects one ormore of the instances from the list, Ariadne server 509 provides thereferents 445 for the objects represented by the selected instances toERIS 505, which retrieves the objects referred to by the referents andreturns them to Ariadne, which then makes a display using the retrievedobjects and sends the display to viewer 507(i). For example, if theclothing instances represent Web pages containing catalog descriptionsof the items, when the user of viewer 501 selects an item from the list,Ariadne server 509 will provide the URL for the item's web page to ERIS505, ERIS 505 will fetch the Web pages, and Ariadne 509 will providethem to viewer 507(i). Ariadne server 509 also provides views 513 whichpermit a user at viewer 507(i) to define, examine, and modify models.The user interfaces for doing so will be explained in detail later on.

[0115] Details of Views 513: FIG. 6

[0116]FIG. 6 shows details of the implementation of views 513 in apreferred embodiment. Models may have multiple views and views may havemultiple presentations. The implementation supports differentpresentations of the same model concurrently, collaborative modeling andreal time knowledge sharing, and independent yet sharable knowledgeexplorations.

[0117] In Ariadne, views are implemented in a subsystem known as Calyx.Calyx 601 is a CORBA server which exports via IDL specifications anabstract interface for views. Calyx 601 could also be any otherdistributed middleware server (for example, proprietary RPCs or DCE orpossibly DCOM). A view 603 is a collection of bins 605 of informationabout the target source: A model or a world. Bins hold information suchas the current objects being shown, whether the attributes of an objectalong any given facet are expanded, what facet a bin is looking at, etc.The typical representation 601 of a view is a structure containing(among other things) a container of bins 605.

[0118] All views and bins (as well as any other externally accessibleresource) are referenced by opaque IDs which are presented to any viewer607 logging into Ariadne. A viewer 607 is a active object through whichthe abstract information is displayed. Each viewer takes the abstractinformation maintained by Calyx in a view 601 and presents it in amanner which is consistent with the interface requirements and look andfeel of a given GUI. For example, a taxonomy might be represented by agraph, an outline, or simply as an indented list of text and the viewerwill use whatever resources are provided by its GUT to make therepresentation. For example, an outline might be presented by a JavaSwing tree widget or an MFC tree widget.

[0119] As may be seen from the dashed lines in FIG. 6, a view 601 may beshared by a number of viewers 607. Calyx ensures that all viewers 607that use a given view 6021(i) are synchronized to the most recentchanges in view 602(i). When a viewer 607(j) requests Calyx to update orotherwise change part of the view (say, expand a node in a bin), Calyxperforms this operation for viewer 607(i) and then asynchronously sendsthe update information to all other viewers actively using the view inquestion. These requests by Calyx to such viewers are client requests toserver portions in those viewers. Hence, Calyx is a client and theviewers must implement a server interface for these asynchronousupdates.

[0120] Calyx also supports (via the model and world infrastructure)various operations on the contents of bins. Specifically, various setoperations (union, set difference, intersection, etc.) may be applied toarbitrary sets of bins. Additional operations may be defined by theuser. The effect of the set operations is to apply the operation on thesets of information represented in the bin to produce a new bin (calleda composition bin) with the computed resulting information. This is thenpropagated to all connected viewers. Further, bins may be combined inthis way to create constraint networks of composition bins. If any binin the network is changed (manually or via automated updates) the effectis propagated throughout the entire affected subnetwork in which the binis connected. These propagated results are sent to all viewers via theasynchronous operations described above.

[0121] Separation of Levels of Information in the Implementation: FIGS.3-6

[0122] An important characteristic of Ariadne is the manner in whichcomplexity is reduced and flexibility increased by separating variouslevels of information from each other. One of these is the separation ofmodel types from models, as seen in the separation of model typedefinition 403 from model definition 413 in FIG. 4. Another is theseparation of models from instances, as seen in FIGS. 3 and 4; thispermits multiple models to be built independently of each other and yetwork over the same world. It also permits models to be reused indifferent worlds. Yet another is the separation of an instance from theobject that it represents, so that the instance serves as a proxy forthe object, as seen in with regard to referent property 445 in FIG. 4and the use of ERIS interface 505 to retrieve objects represented byreferents from a number of different information sources 503. Then thereis the agent/model separation: agents run in the context of models, butthey are defined in terms of model types, not the individual models. Forexample, the refine agent will work with any model that has the taxonomytype. Finally, as seen in FIGS. 5 and 6, views 601 are separated frommodels and worlds and viewers 607 are separated from views 601.

[0123] The User Interface for Building, Modifying, and DisplayingModels: FIGS. 7-12

[0124] A particular advantage of model types is that they greatlysimplify the construction and modification of models. They do so becausethe part of Ariadne which constructs models can use the information inthe model type to automatically place concepts in the proper facets andin the proper locations in those facets and to propagate informationprovided by the user to the concepts that require it. One example ofsuch propagation is the propagation of the refinement agent from theroot of a model of the taxonomy type via the subclass facet to all ofthe concepts in the model.

[0125]FIG. 7 shows the dialog box 701 used in a preferred embodiment tocreate a new model. At 703 there appears a list of thepresently-available model types; the user has selected simple taxonomy,indicating that the new model is to have the simple taxonomy model type;in the name box, the user has input “usr:Clothing”, indicating that thatis to be the name of the new model; at 709, the user may input thedescription. The result of these inputs is of course the construction ofa model definition 413 for the new model, with model name 415 being“usr:Clothing” and model type name 423 being “Simple Taxonomy”. List 705gives an example of what can be done with models. In Ariadne, modelsthemselves are instances in a model whose concepts are model types; onecan thus simply select an already-made model from that model. Ininstance node 437 for an instance representing a model, referent 445simply specifies the location of the model's model definition 413. Theaction model similarly treats agents as instances of a model whoseconcepts are the model types the agents are written for.

[0126]FIG. 8 shows the dialog box 801 used to add a root concept to thesubclasses facet of the new model “Clothing”. At 803 would normallyappear the concepts that are presently in the model; the field is empty,as the model as yet has no concepts. At 805, the user writes the name ofthe root concept, and as before, the user may also add a description.The result of these inputs is the creation of a concept node 425 withthe name “Clothing” in field 427 and the model name “usr:Clothing” infield 435. Since “Clothing” is a root concept and there are no othernodes, the taxonomy type requires that there be as yet no subclassattributes in attribute list 433, but a single is a attribute for“Clothing” itself, and Ariadne automatically adds these to “Clothing”'sconcept node 425.

[0127]FIG. 9 shows the dialog box 901 used to add subclasses to anexisting taxonomy model. Here, the model already has as subclasses ofthe root concept clothing the concepts accessories, apparel, swimwear,and footwear, and further subclasses are being added to the apparelsubclass. At 903, the name apparel of the concept to which subclasses isbeing added appears; at 904, names of already existing concepts appear;since only the first level of concepts have as yet been defined, thenames are those of concepts at the same level as apparel; at 905,finally, is a field for adding a newly-made concept.

[0128] A user may add a subclass either by selecting from among conceptslisted in 904 or by using field 905 to add a newly-made subclass. Foreach newly-made subclass concept that is added, Ariadne creates aconcept node 425 with the name of the concept at 427 and the name of themodel at 435; for each concept being added as a subclass, Ariadne addsattributes in attribute list 433 for the is a facet specifying the newconcept node itself and the concept node for the apparel concept.Ariadne further creates an attribute in attribute list 433 in theconcept node for the apparel concept for the subclass facet whichspecifies the new concept node. Thus, when all of the subclasses havebeen added, they all belong to the subclass and is a facets in themanner required for the taxonomy model type. It should be pointed outhere that if the user attempts to select one of the concepts listed in904 to be added to apparel, Ariadne will determine from the model typethat this is not possible in the taxonomy model type (in a taxonomy, aconcept at one level of the taxonomy may not be a subclass of anotherconcept at the same level) and will not add the concept but willindicate an error. In other embodiments, Ariadne may simply not displayconcepts that cannot be added to the concept selected at 903.

[0129]FIG. 10 shows dialog box 1001 used to relate instances to aconcept. Dialog box 1001 has the same form as dialog box 901, with area903 containing the name of the concept to which the instances are beingrelated, area 905 containing the names of instances that are availableto be added to the concept, and field 1007, which can be used to add anewly-made instance. When a newly-made instance is added, an instancenode 437 is created for the instance, with the instance's name at 439and any description provided by the user at 441. For a newly-made orpreviously-existing instance, an attribute for the item of facet thatindicates the concept sweaters is added to the instance node's attributelist 447, and one for the item facet that indicates the instance isadded to the concept node's attribute list 433. Similar dialog boxes areused to add agents and items that are exhibitors, with correspondingmodifications in the attribute lists of the concept and instance nodes.Ariadne also has a copying interface that can be used to selectinstances belonging to a concept in one model to become instances of aconcept in another. The attribute lists 433 off the instance nodes forthe copied instances are modified to add attributes for the instance offacet specifying the concept, and the other concept's attribute list 433is modified to include attributes for the instance facet for the newlyadded instances.

[0130]FIG. 11 shows how referent fields 445 are set in instance nodes437. Window 111 has three subwindows: two show models that apply to theclothing world: “clothing categories” and “fabrics”. Both models belongto the taxonomy type, and thus both can be displayed as outlines, asshown at 1103. The user wishes to add referents, in this case the URLsof Web pages that show the items represented by the instances, to theinstances that belong to the concept “apparel”. In terms of facets, thatis all of the instances which have an is a relationship to “apparel”,that is, the instances that are related to “apparel” and all of itssubclasses. To perform this operation the user selects “apparel” inoutline 1103; Ariadne then uses a propagator for the taxonomy model typeto generate the list seen at 1107, which is the list of all of theinstances that belong to “apparel” and its subclasses. To assign an URLto an instance, the user writes the URL opposite the instance in field1109. The URL for a given instance goes into referent 445 in node 437for the instance.

[0131]FIG. 12 shows how Ariadne displays a model. Model 1201 is ataxonomy of the events handled by Ariadne. The boxes are the model'sconcepts and the arcs 1203 are the arcs of one of the facets, in thiscase, the is a facet. Selection of facets to be viewed is controlled bycheck box 1205; as seen there, model 1201 is to be displayed showing itsconcepts and its is a facets. More than one facet may be selected, inwhich case, the arcs for each selected facet are displayedsimultaneously.

[0132] Conceptual Factoring and Unification

[0133] This document describes a general graph transformation capabilitywhich we call conceptual factoring and unification (CFU). The CFUtransform operates on an input model or set of models with highlyrepetitive or redundant substructure; these repetitive regions arerooted at concepts which are identified by the user in initiating thetransform. The transform pulls the common subtrees of models intoseparate factored models. One model (the composite model or C)represents a kind of normalized template for common parts of thesubtrees; the other model (the variability model or V) represents theaxes of variability covered by the substructures as a set.

[0134] The terms factoring and unification suggest the dual nature ofthe transform. On the one hand, it requires splitting or factoringoriginal input model(s) into components representing common and variantaspects of the collections respectively. On the other hand, inparticular to create the composite model, it requires comparison andsynthesis (or unification) of similar model structures. Furthermore,since models to which the CFU transform may be profitably appliedtypically categorize analogous but non-overlapping sets of data, theresult of the transform is a set of models that provide the ability totreat disparate collections in a unified way.

[0135] The CFU transform is implemented as a procedure which begins withuser establishment of the roots of the composite and variability models,continues with user selection of portions of input models which may beunified in the composite model, then employs algorithmic determinationof whether concepts are candidates for unification in the compositemodel, and thereupon uses interactive user verification and/ormodification of the results of the algorithmic selection of candidatesfor unification to allocate concepts to the composite and/or thevariability models.

[0136] An Example CFU Transformation: FIG. 13

[0137]FIG. 13 shows a simple CFU transformation 1301. The starting pointis model 1302, which represents the e-catalog of a clothing merchant. Inthe following, models which are the starting points of CFUtransformations will be termed source models. The concepts in model 1302represent categories of clothing. Model 1302 has the taxonomy modeltype, with clothing as the highest class in the hierarchy of concepts.There are two major subcategories: Women's and Men's, each of which hasa subtree of categories. The Women's subtree is labeled 1303 and theMen's subtree is labeled 1305. As one would expect from the generalsimilarity between men's and women's clothing, the categories in thesubtrees are closely related and often identical. For example, Outerwearcategory 1304 in Women's subtree 1303 has the subcategories Raingear,Vests, Parkas, and Jackets, as does Outerwear category 1306 in Men'ssubtree 1305.

[0138] The result of the transform is constellation 1307, which has twoparts, a common factored (C) model 1309 and a variability (V) model1311. C model 1309 is a Clothing taxonomy model that does not have theWomen's and Men's subtrees of the original model 1302, but does have oneof every other subcategory of the original model 1302. Thus, instead oftwo Outerwear subtrees 1304 and 1306, there is a single Outerwearsubtree 1310 that contains the categories that belonged to each ofsubtrees 1304 and 1306. Where two subtrees of model 1302 have differentsubcategories, C model 1309 includes all of the subcategories. Thus,Apparel subtree 1313 in Women's subtree 1303 and Apparel subtree 1315 inMen's subtree 1305 are identical except that the Women's Apparel subtree1313 has an additional subcategory, namely Skirts and Dresses. Apparelsubtree 1317 in C model 1309 includes Skirts and Dresses as well as theother subcategories of Apparel subtrees 1313 and 1315.

[0139] The fact that there are different kinds of clothing for men andwomen in most of the categories in C model 1309 is captured by V model1311. V model 1311 is a taxonomy model that has a topmost categoryGender and two subcategories: Men's and Women's. Thus, after applyingCFU transform 1301, C model 1309 includes specific clothing categorieslike Shoes and Gloves, and V model 1311 has concepts for the primarydifferentiator in the inventory, in this case, the categories Men's andWomen's. Not shown in factored and unified model 1307 are facets thatconnect the instances that represent the actual items of clothing thatbelong to each category to the C and V models. Each instance isconnected by an item facet to the proper category in C model 1309 and byanother item facet to the proper category in V model 1311. Consequently,an instance for a pair of men's shorts has one item facet to thecategory shorts in Apparel 1317 of C model 1309 and another item facetto the category Men's in V model 1311.

[0140] Note that the concept Men's/Shoes of model 1302 appears nowherein C model 1309 and V model 1311, but we can still obtain the originalsets of instances that were associated with this concept by makingqueries that use concepts selected from both the C model and the Vmodel. For example, to obtain the instances associated with Men's/Shoesin model 1309, one selects instances that belong to the intersection ofthe set of instances belonging to Shoes in C model 1309 and the set ofinstances belonging to Men's in V model 1311. This intersection is ofcourse the set of men's shoes. Not only can we make any query that waspossible in model 1302, we can also make simple queries on concepts inthe C and V models that return result sets not directly obtainable inthe original model. For example, the instances obtained from Shoesinclude a mix of instances that was obtainable from model 1302 only byquerying the Men's and Women's sub-trees separately. This ability toaccess two originally separate collections of content via a single modelis one powerful benefit of the CFU technique.

[0141] The CFU technique can also be used in unifying independentlydeveloped taxonomies; although in these cases differences between thetaxonomies are likely to be noisier and more arbitrary. Suppose we aretrying to create a single reseller's or comparison shopping guide'sindex to two different clothing manufacturers, M. M. Legume andSkyFront. Suppose we are looking at the men's clothes sections of bothcatalogues. We find “Men's Shirts” and “Shirts for Men”—are these thesame concept? In one sense, they are not, because different instancesbelong to each concept. In another sense, they are because the twoconcepts are analogous. As seen from the example of FIG. 13, what CFU isconcerned with is analogous concepts. Because that is the case, CFUemploys both automatic processing of concepts and human input todetermine whether sets of concepts that appear after the automaticprocessing to be analogous and therefore candidates for unificationreally can be unified and also to determine whether sets of conceptsthat do not appear to be analogous nevertheless can be unified.

[0142] CFU Concept of Operations: FIGS. 14 and 15

[0143] Terminology

[0144] The reader is reminded of the following terminology for theAriadne system. Details may be found in the discussion of the Ariadnesystem above. Each Ariadne model is associated with some set ofinstances in a collection called the world. Included in the instancesassociated with the model are item instances representing items that arerelated to the concepts in the model. The set of item instancesassociated with a concept by means of a particular type of facet arecalled the concept's extent with regard to that facet; the extent of amodel for a particular facet type is the union of the extent for theparticular facet type of all concepts in that model. An extensionalinterpretation of a model's semantics interprets the model in terms ofthe item instances classified under its concepts. An extensionalinterpretation of the model's semantics interprets the model in terms ofthe relationships between the concepts that are defined by the model'sfacets and/or agents. As is apparent from the foregoing, a concept mayalso be interpreted intensionally or extensionally. For example, if twoconcepts in the same model are linked to the same set of instances, theconcepts are extensionally equivalent. If two concepts have similarfacets connecting them to the same or similar other concepts and areassociated with the same agents, the two concepts may be extensionallyanalogous.

[0145] In Ariadne, several different models may categorize the same setof objects. When this is the case, it is often useful to identify one ofthe models as the primary category system for the objects, and toidentify other models as descriptions of particular aspects of theobjects. We refer to the category model as a concept model for thecollection; the others are called feature models. The set of modelsthat, taken together, describe a certain collection of objects arereferred to as a model constellation. In FIG. 13, C model 1309 and Vmodel 1311 form a model constellation in which C model 1309 is theconcept model and V model 1311 is a feature model.

[0146] In addition to being related by sharing a set of objects, aconcept model and a feature model may be related by feature facets thatconnect concepts of the concept model to concepts of a feature modelassociated with the concept model in the model constellation. Forpurposes of the present discussion, a feature facet is any facet createdduring the execution of the CFU transform which connects concepts in theC model to concepts in the V model, such that something of the intendedsemantics of the transform is enforced on subsequent changes to themodels, either through the basic in-built semantics of the model type(s)chosen for models C and V, or via additional “semantics enforcing”agents which are written and attached (directly or via inheritance fromroot or other upper concepts) to those models. For example, a featurefacet could be introduced to either express (via indications provided toa human interacting with the system at a later time) or enforce the factthat any instance classified as belonging to a particular concept in theC model must also be classified to a particular concept in the V model.

[0147] CFU Operations

[0148] In the discussion of CFU operations, the following scenario willserve as an example: The user is an internal catalog designer for theonline sales website of a clothing merchandiser. Using the techniquesdescribed in Software composition using graph types, graphs, and agents,supra, the clothing merchandiser's catalog has been converted into asingle Ariadne model of the taxonomy type. The taxonomy model hasconsiderable internal redundancy and a large number of products havebeen classified according to the model. An analyst looks at the modeland decided that factoring is a good strategy. The model has a largenumber of instances associated with the various concepts, but no agentsor constraints yet defined on the concepts beyond those that are part ofthe taxonomy type definition. What is desired as an output is aconstellation of models that permit indexing of instances in a way thatis consistent with a well-formed Ariadne model architecture. Theconstellation of models should capture common concepts more clearly thanthe original taxonomy model, should lose no information from theoriginal taxonomy model, and should permit more ways of accessing theinstances than were possible with the original taxonomy model.

[0149] Simplest Form of CFU: FIG. 14

[0150] This general operation or transform 1401 is depicted in itssimplest form in the schematic shown in FIG. 14. The source model is T1403. T 1403 is a single legacy taxonomy that has a root r 1405 and twohighly-similar subtrees I₁ 1411, which is under concept a 1407, and I₂,which is under concept b 1409. The result after application of the CFUtransform is composite model C 1415 and model T′ 1417. Composite model C1415 contains a synthesized composite I₁ # I₂ of subtrees 1411 and 1413that has been factored out from model T 1403 and model T′ 1417 is a copyof the original model T 1403 that contains the concepts of T 1403 whichcould not be factored out. In the following, models like T4 1403 will betermed remainder models. As will be explained in more detail later, theuser of the CFU techniques decides what is to be done with the conceptsof the remainder model: whether they are to be added to the concepts incomposite model C 1415, incorporated into a variability model like Vmodel 1311, or simply discarded from the model constellation thatresults from application of the CFU techniques to T 1403. An importantaspect of the CFU technique is that it is functional, that is,application of the technique to source model T 1403 does not changesource model T 1403.

[0151] CFU that Produces a Constellation of Models Including aVariability Model: FIG. 15

[0152] Transform 1501 shows how the CFU technique may be used totransform a single taxonomy model T 1503 into a constellation 1514 oftaxonomy models including a composite model C 1515, model T′ 1517, and avariability model V 1519. Transform 1301 of FIG. 13 is thus an exampleof transform 1501. As in FIG. 14, composite model C 1515 is a taxonomymodel that combines the concepts of subtree I-1 1511 and subtree I-21513. Variability model V 1519 contains the concepts b 1507 and f 1509that are the roots of subtree I-1 1511 and subtree I-2 1513. Theseconcepts indicate why subtrees 1511 and 1513 are not a single subtree inmodel T 1503. The user may have to provide a concept? 1521 that servesas a root for concepts b and f. In transform 1301, the provided conceptis gender. Model T′ 1517 is the remainder model that contains theconcepts of model T 1503 that remain after the removal of subtrees I-11511, I-2 1513, and their roots b 1507 and f 1509.

[0153] Transform 1523 shows how the CFU technique may be used totransform more than one source taxonomy model (here, models T-1 1525(1)and T-2 1525(2) into a constellation 1544 of taxonomy models includingcomposite model C 1545, variability model V 1549, and two remaindermodels 1548. The subtrees I-1 1533 and I-2 1543 whose concepts arecombined in composite model 1545 now come from different source taxonomymodels; similarly, the concepts in variability model V 1549 also comefrom different source taxonomy models. There is finally now a remaindermodel corresponding to each of the source taxonomy models; T-1′ 1547(1)corresponds to T-1 1525(1) and T-2′ 1547(2) corresponds to T-2 1425(2).

[0154] The C and V models of constellations 1514 and 1544 may be relatedin different ways. The simplest way in which they may be related isthrough the instances that belong to the concepts that are combined inthe composite model. Each instance will at least have item and/orexhibitor facets that connect the instance not only to a concept in thecomposite model, but also to a concept in the variability model. If theC and V models are also related to each other as concept models (C) andfeature models (V), there will be feature facets connecting concepts inthe C models to concepts in the V models.

[0155] It should be noted here that in the current embodiments, CFU maybe applied to more than two input subtrees; and that in otherembodiments these subtrees may come from either subtrees of one inputtaxonomy model or from multiple input models in various combinations. Infact, the benefits of working with the “factored” model constellationincrease with the number of input models/subtrees to which the transformis applied; since there will always be only two primary models, the Cand V models, resulting as output. It should also be noted that in otherembodiments, CFU may be applied to models having other than the taxonomymodel type. Among these other cases are the following:

[0156] input taxonomies that allow items to be classified under multiplecategories rather than enforcing a single category;

[0157] input taxonomies that allow “multiple inheritance” links suchthat categories may be children of more than one parent category;

[0158] models that represent part/whole or structural models of similarconfigurations.

[0159] The minimum requirement for application of the CFU transform to aset of source models is a facet type defined for the model type of eachsource model which allows a hierarchical walk through relevant conceptsof the analogous models. As is apparent from this requirement, the modelto which the transform is applied need not have instances. Of course, inall such cases, the relationships between the structures of the C modeland the structures of the V model will depend on the types of thosemodels, as will the rules for determining whether one structure isanalogous to another.

[0160] Much of the work of the CFU transform is deciding which of theconstellation of output models the concepts of the source models shouldbe allocated to. While related information such as the names/labelsassociated with concepts being allocated provides some help (forexample, both men's and women's shoes are called shoes), there are manysituations where concepts which have different names are in fact similar(for example, cologne and perfume) and concepts that have similar namesare in fact different in significant ways. It will thus in general notbe possible to completely automate a transform from a given set ofsource models to a given constellation; as will be explained in detailin the following, a key aspect of the CFU techniques described herein isthe manner in which these techniques elicit information and recorddecisions about allocation of concepts from the user.

[0161] Generalizations

[0162] Because of the principle of having agents and transforms be asfunctional as possible, the default behavior of the CFU transform is notto transform source models directly into a constellation of composite,variability, and remainder models, but to end up with both unchangedsource models and the constellation resulting from the transformation.In some cases, facets may need to be created in the source models duringthe transformation links, either to keep track of state and positionalinformation during the transformation, to make backward traceability ofa transformation possible, or where the transformation has beenperformed iteratively, first on the source models and then on theremainder models resulting from each iteration of the transformation.

[0163] In the most general case, the input models will be the following:

[0164] I₁ through I_(m): The original m subtrees submitted to thetransform.

[0165] T₁ through T_(n): The original n source models containing thesubtrees. Note that m≧n.

[0166] After completion of the transform, input models are unchanged,and we will have the following constellation:

[0167] C. The composite model drawn from comparison and unification ofI₁ through I_(m). Most concepts within C will be associated with one ormore analogous concepts from some of I₁ through I_(m) and will haveattached all instances associated with those concepts.

[0168] V. The variability model, which represents the differences in theinput subtrees. V contains, at minimum, one concept for each of the minput subtrees I₁ through I_(m). This could be implemented by leavingleaf concepts in T_(i)′ (the transformed versions of the originalmodel(s) T₁ through T_(n) as described below). However, a unified newmodel will be more useful in the case of multiple models as inputs,since otherwise roots of the subtrees would not appear in a singlemodel. In any case, a primary purpose of factoring out V is to easere-organization of single taxonomies from factored models as an output(e.g., fold taxonomy to break out Clothing prior to men's, women'setc.).

[0169] T₁′ though T_(n)′. The resulting remainder model(s), copies ofthe original model(s) sans the factored-out repetitive conceptstructure; that is, with concepts removed that can now be generatedthrough cross-queries of C and V.

[0170] Overview of a CFU Procedure: FIG. 17

[0171] The CFU transform may be done in any system which provides anenvironment for manipulating graphs as required for the transform.Techniques for doing the CFU transform in such systems are termed hereinCFU procedures. At the very highest level, all CFU procedures involvethe phases described below. FIG. 17 shows a high-level flowchart 1701for a CFU procedure. Processing steps in the flowchart are related tothe phases by the reference numbers in the flowchart. The flowchart willbe explained in more detail below.

[0172] Initialization. This involves getting access to the sourcemetadata in a form suitable for further processing; selecting the sourcetrees and the input subtrees to be factored; identifying and ifnecessary creating the appropriate constellation of models to hold theresults; and setting up parameters and defaults for the behavior of theprocedure (flowchart 1705).

[0173] Making a comparison set. In this phase, the system makes acomparison set of concepts below a current concept of focus in eachsubtree. The first concept of focus is the root of each input subtree(flowchart 1707, 1719).

[0174] Correlation. In this phase, the system establishes correlationsbetween concepts in the current comparison set that may be analogous.The correlations can be established using a variety of differenttechniques. In a sense, the result of the correlation represents thesystem's best theory of analogies between concepts of the currentcomparison set (flowchart 1721).

[0175] Validation and Elicitation. The system next solicits user inputto confirm or modify the results of the correlation. In some instancesthis involves eliciting new semantic information about the concepts fromthe user (flowchart 1723).

[0176] Allocation. Once the user has interpreted the correlations for acurrent comparison set are available, the system can allocate theconcepts of the comparison set to the C, V, or T′ models of the outputconstellation. While doing this, the system can also set up the facetsrequire for the proper interrelationship of the models of the outputconstellation to each other and to the instances belonging to the modelsof the output constellation (flowchart 1725).

[0177] The foregoing phases take place in the context provided by theinput subtrees whose concepts are being analyzed and by the C and Vmodels being produced by the factoring process. The phases of making acomparison set, establishing correlations, soliciting user input, andallocating the concepts of the comparison happen at every level of theinput subtrees. In the preferred embodiment, a recursive procedurefactor-model traverses the input subtrees in a depth-first fashion. Ateach level, correlations are established for all of the sibling conceptsof the level before descending to the next level. Thus, in the inputtrees 1303 and 1305 of model 1302, the concepts Outerwear, Footwear,Accessories, Swimwear, and Apparel of tree 1303 are correlated with theconcepts from the same level of tree 1305. At the next level, theconcepts belonging to Outerwear in both trees are correlated, then theconcepts that are the children of Footwear in both trees, and so on,until the child concepts belonging to all of the first-level conceptshave been correlated.

[0178] Continuing in more detail with the flowchart of FIG. 17, in thefirst part 1701 of the flowchart, block 1705 performs initialization;the user selects the source graphs and the input subtrees from thosegraphs and also establishes the roots for the C and V models. Block 1707sets up the first recursion of the procedure factor-model. The procedureis invoked at 1709 with the root of C and the root of each of thesubtrees, or in FIG. 1301, the concepts Women's and Men's. The CFUprocedure terminates when factor-model returns from its recursions.

[0179] factor-model is shown in detail at 1712. As shown at 1713,factor-model is invoked for the next level of the tree with twoarguments: <current root in C> and <current concepts of focus>. <currentroot in C> is a concept which was added to C at the current level. Thisconcept will be the root for the child nodes that will be added to C atthe next level. In the first recursion, <current root in C> is the rootof C, which the user has given the name Clothing. <current concepts offocus> are concepts in the current level of the input subtrees whichhave been validated by the user as analogous to the concept in C that isthe <current root in C>. For the first recursion only, the <currentconcepts of focus> are Men's and Women's.

[0180] factor-model first tests whether the <current concepts of focus>have any child concepts. If they do not, they are leaf concepts, thebottom of a part of the input trees has been reached, and the recursionreturns (1715,1717). If the current concepts of focus do have children,the children all become members of a comparison set (1719). The membersof the comparison set are then correlated to find analogous concepts(1721). One technique for correlation is matching concept names; whenthis technique is applied to the first level of trees 1303 and 1305, theconcept names of the subtrees match exactly. The user then verifies thatthe concepts with matching names are in fact analogous, and refines thecorrelation if necessary (1723). Once the user is finished refining thecorrelation, factor-model uses the correlation to allocate the conceptsat the level it is working on. In model 1302, the concepts Outerwear,Footwear, Accessories, Swimwear, and Apparel are allocated to the secondlevel of C model 1309.

[0181] At the next level, factor-model must be invoked for each of thenew concepts that has been added to C at this level. This iterativeinvocation is shown in loop 1731. When there are no more new concepts inC at this level, factor-model returns. For the selected new concept in C(1729), the selected new concept becomes <current root of C> and theconcepts in the input trees which are analogous to the new concept in Cthat is now <current root of C> become the <current concepts of focus>(1735). Thus, if the new concept in C 1309 that is the <current root ofC> is Accessories, the <current concepts of focus> are the conceptAccessories in input tree 1303 and the concept Accessories in input tree1305. Then factor-model is invoked with the new values for <current rootof C> and <current concepts of focus>. In that recursion, factor-modelwill correlate the concepts in input tree 1303 that are dependent fromthe concept Accessories with the concepts in input tree 1305 that aredependent from the concept Accessories and with the assistance of theuser, validate the correlations and allocate the concepts to C, V, andT. The concepts that are the children of each of the other concepts inthe first levels of the input models 1303 and 1305 will be correlated,validated, and allocated in the same fashion. The process describedabove continues level by level until all of the input subtrees' conceptshave been correlated, validated, and allocated. It should be noted herethat the matches by which the concepts belonging to C and theirrelationships to each other are determined must be done level-by-level,but in other embodiments, they may be done breadth first instead ofdepth first.

[0182] User Interface for Verification and Refinement: FIG. 16

[0183] As indicated in the foregoing discussion and in FIG. 17, once thesystem has found the best overall set of matches of concepts in thecurrent comparison set, the user must review what the system has found.The system presents the user with its best overall set of matches andsets of anomalous concepts, that is, concepts for which no matchesresulted from the present recursion and earlier recursions. The user mayvalidate a match found by the system, may override a match found by thesystem, and may make matches other than those specified by the system,including matches between concepts belonging to the current comparisonset, matches between concepts in that set and anomalous concepts, andmatches between anomalous concepts.

[0184]FIG. 16 shows a graphical user interface 1601 for user validationand refinement. Control of the interface is by selection of elements andmanipulation of buttons. At 1611, there is a list of pairs of candidatematched concepts. One member of each pair is from the current comparisonset of concepts; the other member is from the commonality model C. Ifthe user finds that a pair is not a proper match, the user selects thepair in list 1611 and clicks on split match button 1617. At that point,the system adds the concept from the current comparison set to the list1607 of anomalous concepts from this level of I_(i) (the input subtreecurrently being analyzed); if the concept from C has no other match atthis level, it is added to the list 1608 of anomalous concepts from C.Conversely, if the user indicates that a concept in list 1607 matches aconcept in list 1608 by selecting the two concepts and clicking on joinconcepts button 1609, the system adds the selected pair to list ofmatching pairs 1611. When the user is satisfied that list of matchingpairs 1611 correctly shows all of the matching pairs from the conceptsfrom I_(i) and C being displayed in interface 1601, the user clicks onaccept matches button 1613, and the matching concepts are removed fromT′ and incorporated into C. Anomalous concepts in list 1608 remain in C.

[0185] To aid the user in making a decision, graphical user interface1601 provides the user with a variety of context information. Areporting window 1605 indicates the rationale for each pairing in list1611. When a user selects a pair in list 1611, the rationale for thepairing appears in window 1605. Rationales in a preferred embodimentinclude specification of the match by the user, a match based the valuesof a property of the two concepts, or a match based on similarities inthe facet structures of the concepts.

[0186] The user is also provided with the context of each member of aselected matched pair in the model to which it belongs. The context forI_(i) appears in window 1603 and the context for C appears in window1621. The context in the window is fisheyed, that is, when a pair ofconcepts is selected in list 1611, the views in windows 1603 and 1621change to show the concept from the selected pair, its siblings, itsparent and ancestors to the root of I_(i), and perhaps its children. Theconcept of focus is highlighted. Windows 1603 and 1621 respond in thesame fashion when an anomalous concept is selected from list 1607 or1608. Instances windows 1615 and 1619 indicate the instances that haveitem facets connecting them to the concepts selected in list of pairs1611, list of anomalous concepts 1607, or list of anomalous concepts1608. Control of what portions of interface 1601 are displayed is bymeans of a command bar (not shown) in the graphical user interface. CLevel Search 1623 is a window which allows the user to exploredynamically elsewhere within the C model in order to find possiblematches for anomalous concepts.

[0187] Not shown in FIG. 16 is a window which permits the user to assigna name by which the concept which is represented by a matched pair willbe known in model C. Naming rules for matched pairs may followheuristics such as these:

[0188] If the concept in I_(i) is a clean match to a concept already inC, the assignment is made automatically, with reporting or confirmationbased on the strength and priority settings of matching rules applied.

[0189] If the concept in I_(i) is not a clean match, the user has theoption of keeping the current concept name in C, renaming with theconcept name from I_(i), or providing a new name for the concept in C.

[0190] Once the preferred name is selected, the user has the option ofconverting the unused concept name(s) in I_(i) and/or C to Synonymproperties associated with the concept in C. Obviously, only namesdifferent from the preferred name or names already in the synonym listare worth storing as new synonyms. For example, the user could matchsyntactically different terms like shirts and blouses.

[0191] Users can be prompted to flag a value of a synonym property as asubstring to be checked via a synonym match rule in a list of matchingrules maintained by the system. For example, the user might discoverMen's and Guy's concepts at a certain point and make Guy's a synonym ofMen's. Adding a synonym in this fashion refines the matching process.

[0192] Ways of Correlating Concepts

[0193] The correlation phase of the CFU procedure selects candidatepairs of matching concepts.

[0194] There are a number of different techniques that can be used todetermine whether one concept matches another. The CFU procedure canemploy any and all of these techniques. The techniques include thefollowing:

[0195] Textual or syntactic analysis

[0196] Hierarchical structure

[0197] Synonyms (including those dynamically generated from earliermatches)

[0198] User elicitation

[0199] Domain models

[0200] Extensional evidence (Instances)

[0201] Intensional evidence (Feature Links)

[0202] Extra-model information [e.g., agent attachments, properties,ERIS call-outs]

[0203] Textual or syntactic analysis. Perhaps the most basic way ofmaking correlations is by similarities in names. For much examplemetadata that we have examined there are often exact matches in names.In other cases differences are minor, involving word-stemming or casedistinctions. (Matching techniques from search technologies could beapplied here, although this is pairwise comparison rather than matchingon one privileged search string.) We can consider these textual orsyntactic techniques to be concept-to-concept matching techniques.

[0204] In addition to deriving clues about basic one-to-onecorrespondence among concepts, we can detect certain kinds of anomaliesor other structural variations in the models by looking for constructslike additive word phrases (Men's Clothes, Men's Casual Clothes) orcompound phrases (Hats, Hats and Gloves).

[0205] It is important to stress that the primary textual material beingsearched is the concept name-space, not arbitrary documentary text.Since concepts within taxonomies have already been named with someattempt at consistency and descriptiveness, these names can form anexcellent corpus of semantically significant source material. Also,because the matching is being done within the context provided by thesubtrees and the C and V models, some contextual scoping has alreadybeen applied in limiting the sets of terms on which match-testing isbeing performed.

[0206] Alternative Implementation: Matching Rule Checklists. Onepossible approach to concept-to-concept textual matching is to use achecklist of primitive matching rules. Each rule takes as input twoconcept names, drawn from two of the subtrees to be matched. Each ruleapplies a specific technique for determining whether or how well theconcept names match. Results could be expressed as a Boolean or as ametric.

[0207] Some method for selecting the maximum confidence matching rulefor a given pair must be specified. The behavior of the overalltransform can be conditioned to a great extent by allowing afall-through semantics for these checklists of various rules. With thesesemantics, once a rule is found that applies to a pair of concept names,it can be assumed to offer the strongest evidence for correlation and nofurther rules need to be checked. When rules return metrics rather thanBoolean results, the fall-through condition could be triggered by someminimum threshold. Alternatively, sets of rules could be tested, theresulting metrics either combined in some way or the maximal valuetaken. There may also be advantages to separating rules into sub-liststhat trigger different behavior in terms of the interaction with theuser. Some typical categories of this kind might include the following:

[0208] If the rule matches, apply automatically (and silently).

[0209] Apply and report (useful mostly for debugging purposes; otherwisesimplifies to case below).

[0210] Apply and report for confirmation (as in current interface; usermust take action to undo the match).

[0211] Do not apply but report as suggestion (for very low confidencerules).

[0212] Ignore the rule (allows rules to stay in the repertoire but to beeasily de-activated).

[0213] In the latter two cases, the rules start to take on the characterof explanation aids.

[0214] Starting Set of Matching Rules. Here we suggest a starter set ofsimple rules which can be applied without requiring call-outs tosophisticated natural language processing:

[0215] First, to find exact correlations:

[0216] An exact match of the text strings triggers a strongly probablematch.

[0217] Depending on the original metadata import scheme and restrictionson name uniqueness imposed by the supporting modeling system, there maybe conventions for uniquifying (or de-redundizing) names upon input(e.g., clothing and clothing1). If these conventions are known to thematching procedure, they can be reversed in order to match these stringswith a high degree of confidence.

[0218] The sum of squares of matching substrings metric used in aprototype turns out to provide a relatively robust extension of plaintext matching.

[0219] Certain syntactic transformations such as plurals can be matchedas almost exact.

[0220] Synonym lists associated by concept. For example, the concept Mencould have the synonym Guys added, either upon initial creation or as aresult of previous matches. The text match used could then be the bestscore

[0221] We can also count on some characteristics of category names suchas clustering multiple names under a single category name (Hats andGloves, for example). These are also significant syntactic clues forsubset relations.

[0222] AND matches several syntactic connectors (and, &, “,” etc.)

[0223] A: (X AND Y) matches with B: (Y AND X); extended for multipleterm lists. This matching rule assumes that while ordering of siblingsmay be arguably of semantic import in the input hierarchies themselves,ordering of sub-terms within a clustered concept name can be ignoredduring the match.

[0224] Adaptive Weighting of Matching Rules. Finding consistent patternsof differentiation across models or subtrees might allow particularmatching rules to be applied with more confidence. For example, onemodeler may have used plural names, another singular names, so aplural/single unifier rule might be exercised repeatedly in setting upcorrelations between those two models. An implementation can exploitthis by dynamically adjusting the weighting and/or ordering of rules tobe applied, based on initial weighting and number of times the rule wasapplied to an accepted match within the current subtrees.

[0225] Hierarchical structure. Because the matching of concepts occurswithin the context provided by the models, the structural and positionalinformation that the model provides about a concept can be bought tobear in deciding whether two concepts match. For example, whenpotentially matching concepts appear as members of sets of siblingswithin structurally analogous hierarchies, corresponding siblingsprobably represent matching concepts. Thus, if we had Man, Woman, Childas one set of siblings and Man, Woman, Kid as another, we have someevidence for correlating Child and Kid that derives solely from thestructure and positions of the concepts in the tree. A variety ofstructural factors can be used in weighing the degree of confidence of amatch in the context of two sibling sets:

[0226] The relative confidence level of the syntactic comparisonfunction (e.g., perfect string match vs. substring match)

[0227] The closeness in number in the cardinality of the two siblingsets (e.g., a set of 4 concepts match more confidently to a set of 4than a set of 6).

[0228] The number and weight the other concepts in the sibling sets(i.e., if all concepts match extremely well except one, this is strongerevidence for the match, despite the absence of textual cues).

[0229] The relative positions of the concepts being matched in the set.

[0230] The number of matches across all the sibling sets in thecomparison set.

[0231] Additional structural information. Since concepts live withinmetadata structures, choices made at one level determine both the actionof the CFU procedure and quite possibly results at lower levels. Ifthere are subtrees below Kid and Child we must accept the correlationbetween them to recursively start conceptual factoring on their twosubtrees. If the match of Kid with Child was correct, the subtreesshould be highly consistent; if they are not, the match was probably notcorrect. Moreover, the further down the hierarchy we go, the morecertain we should become of the quality of the match.

[0232] Alternative Implementation. The following technique is apreferred implementation for structurally weighting concept-to-conceptmatches in parallel within the context of the sibling sets:

[0233] Suppose we have two sets of sibling concepts c^(A) ₁ throughc^(A) _(i), c^(B) ₁ through c^(B) _(j). If then, without loss ofgenerality, choose i to be the index of the smaller sibling set.

[0234] Form an i×j matrix (note by construction there are equal or fewerrows than columns). The best matching score for each c^(A)/c^(B)combination (as derived from applying some concept-to-concept matchingapproach like that described in the previous subsection) will be storedin the cells of this matrix.

[0235] Various weighting schemes can now be applied to the matrix, basedon the aspect ratio of i to j (similar cardinality sets of siblings aremore likely to be analogous), and positional matches (a match in thefirst term of each sibling list should count for more than a matchbetween the first and third terms), etc. This can be done by a weightfavoring the major diagonal (position 1,1 through i,j) and proximitythereto.

[0236] Once weight-adjusted scores for each combination are calculated,an overall set of correlations for the sets of siblings must be chosen.The current CFU prototype's algorithm uses a linear programmingtechnique which prefers matches which are clearly better than the nextbest match.

[0237] Metrics for individual pairwise comparison can be definedindependently of the positional adjustment matrix weighting scheme; thelatter can be independent in turn from the linear programming or otherrules which help select the optimal overall set of pairwise associations(with left-over anomalous concepts) for respective sets of siblings.

[0238] A Prototype Implementation of a CFU Procedure: FIGS. 18-21, 24

[0239] A prototype CFU procedure has been implemented and used on anumber of real Ariadne models. The prototype includes an algorithm formatching sets of concepts to one another, and for asking the users forguidance when this match is not sufficient to complete the factoring. Ithas been used successfully to factor models as follows:

[0240] Factoring out repeated categories in an LLBean Web site index,

[0241] Factoring out repeated categories in a mock e-commerce portalthat was created by us, using connections to four clothing retailers,

[0242] Factoring out the common structure in a marketing document, whereeach page of the document describes another company.

[0243] Overview of a Prototype CFU System: FIG. 18

[0244]FIG. 18 shows an overview of a prototype CFU system 1801 that isimplemented in Ariadne system 1803. Prototype 1801 is being used in FIG.18 to factor Clothing model 2402 of FIG. 24 into a constellation 2403consisting of C model 2405 and V model 2407. The instances 1810 for theClothing, V, and C models are contained in world 1809 and are related tothe models by item facets 1815. In terms of the CFU transform, Clothingmodel 2402 is the T model, with the subtree of the concept Women's 2409and the subtree of the concept Men's 2411 being employed as inputsubtrees I₁ and I₂ respectively. All of models 2401, 2505, and 2407 aremodels of the Taxonomy type, and are thus associated with taxonomy modeltype 1805, as indicated by the dashed arrows. Also associated withtaxonomy model type 1805 is factor_models agent 1807, which performs theCFU transform on input models of the taxonomy type. While doing thetransform, agent 1807 maintains a matched pairs list 1811 and an anomalylist 1813. Agent 1807 uses these lists to produce a graphical userinterface for receiving user input concerning concept matches. The GUIis a simplified version of the GUI of FIG. 16 and will be shown in moredetail below.

[0245] Details of Factor_Models Agent 1807: FIG. 19

[0246] factor_models is an Ariadne invocation agent; this means that itis invoked through the Ariadne invocation sequence. The invocationsequence is the following:

[0247] the user selects the agent factor_models from a bin 605 (FIG. 6)of Agents;

[0248] Next, the user selects a number of concepts as roots of thesubtrees that are to be factored from Clothing model 2401;

[0249] Then, the user selects Invoke with the right mouse button to callfactor_models on the selected concepts;

[0250] Finally, the agent prompts the user for the names of the modelsthat will be C and V; the name of C is prompted with the query, “what dothese things have in common?” while the name of V is prompted with thequery “How do these things differ?”

[0251] As is clear from the foregoing, factor_models only supports asingle input model T, but it may be used to unify and factor any numberof subtrees in T.

[0252]FIG. 19 is a flowchart of factor_models. The algorithm isdescribed in detail in the following.

[0253] 1) factor-models is attached to a model of taxonomy type and isinvoked with a set N of the concepts that are the roots of the inputsubtrees I₁ through I_(m) (1903 in flowchart 1901.

[0254] 2) The user responds to requests for the names of the C and Vmodels (1905).

[0255] 3) The procedure makes the models and their roots (1907)

[0256] 4) Call the recursive procedure factor-models-fn (<root of C>, N)(1909). When factor-models-fn returns, the algorithm terminates (1911).

[0257]FIG. 20 is a detailed flowchart of factor-models-fn procedure2001. The algorithm for the function follows. The procedure is invokedwith a current root concept cr in C and the set N of concepts whosesubtrees are being unified (2003):

[0258] 1) Get the set of child concepts n for each of the concepts c ofN; if there are no child concepts for any c, the recursion is done;return (2004,2005,2006).

[0259] 2) call a function find_common_sets with the set of sets of childconcepts; this function does the correlation, user validation, andallocation (2007); it returns a set R of names of concepts to be rootedin cr.

[0260] 3) The concepts corresponding to the names in R are created andadded to C in 2011.

[0261] 4) factor-models-fn is invoked for the next level of recursion at2013; there is an invocation for each new concept cr added to C in 2011;in each invocation, N is the concepts in the subtrees that werecorrelated to cr in P.

[0262] Continuing with find-common-sets 2017, as shown at 2018, theprocedure is invoked with a set of sets of concepts.

[0263] 1) At 2019, the procedure is initialized; the smallest set ofconcepts s_(i) is assigned to the canonical set of concept names C andis removed from S; the list P of matching pairs is initialized so thatthe matching pairs are all names from s_(i).

[0264] 2) Execute loop 2033 until there are no more sets of concepts inS (2021) and return R (2023). In the loop,

[0265] a) set x to the current set s_(j) from S and remove s_(j) from S(2025);

[0266] b) find the best match between each concept s in x and a conceptd in the canonical set C; save the best match (d,s) in P; if there is nomatch, save s in A;

[0267] c) Receive input from the user verifying and/or changingassignments of s's to pairs (d,s) in P or to A.

[0268] d) Pairs consisting of unmatched concepts from A are unioned toP; the concepts in A are unioned to S.

[0269] Matching Concepts in Factor-Models-fn

[0270] The matching algorithm used in factor-models-fn is based on asimple similarity metric between strings. The metric is given by theformula:${{dist}\quad ( {s_{1},s_{2}} )} = {\sum\limits_{c}( {l(c)} )^{2}}$

[0271] where c ranges over all the substrings that are common between s₁and s₂, and l(s) is the length of the string s. This matching algorithmfavors matches that are unambiguous, that is, where the best match isclearly better than the second-best match.

[0272] The algorithm is implemented in factor-models-fn as follows:

[0273] 1) For the current d∈S, find the c∈C that maximizes dist(d,c).Find also the “runner up”, that is the d′∈S, d′≠d that maximizesdist(d′,c) for the remainder of C.

[0274] 2) Calculate the best matching concept of C,b(c)=dist(d,c)−dist(d′,c).

[0275] 3) Select d∈S such that b(c) is maximized for concept c of C;

[0276] 4) Add the pair d, c to the pair list P.

[0277] An Example of the Operation of Factor_Models: FIGS. 21-24

[0278] Operation of factor_models is demonstrated with Clothing model2401 of FIG. 24 as the source model. The user selects Women's and Men'sas the leaf concepts in V. The subtrees of concepts whose roots areWomen's and Men's are consequently I₁ and I₂ respectively. The conceptsin I₁ and I₂ will be examined for matches beginning with the children ofWomen's and Men's; then the children of matching concepts will beexamined for matches, and so on, until all levels of I₁ and I₂ below theroot concepts have been examined for matches.

[0279] Since both Men's and Women's have the same number of childconcepts, the child concepts of either can be chosen as the initialcanonical set C; in this case, the child concepts of Men's are chosen tostart. The subclasses of the initial C are then matched against thechild concepts of Women's, which are what remains in S after the childconcepts of Men's are removed to make the initial canonical set C.

[0280] The result of the match is shown in window 2101 of FIG. 21.Window 2101 is the window in the prototype that corresponds to interface1601 in FIG. 16. The prototype window includes only subwindows fordisplaying lists corresponding to lists 1607, 1608, and 1611. At 2105are displayed any anomalous concepts from the current S; at 2107 aredisplayed any anomalous concepts from the current canonical set C. Inwindow 2119 is displayed the current list 2113 of matching concepts P.In this case, all of the concepts of S and C match, so P includes themall and they all appear in list 2113, while windows 2105 and 2107 areempty. The user can accept the matches by clicking on OK button 2115. Ifthe user is not satisfied with the matches, the user can select amatched pair and use split button 2111 to split it; the members of thesplit pair will appear in windows 2105 and 2107. Concepts in windows2105 and 2107 may be selected and joined as a matched pair in 2113 byusing join button 2109.

[0281] Once the user clicks on OK button 2115, concepts corresponding tothe names in the canonical set C are allocated to composite model C2407. factor_models then proceeds to the next level. The immediatechildren of the concepts in each of the matched pairs of the first levelare matched; thus for the pair apparel (from Women's) and apparel1 fromMen's), the concepts Slacks, Vests1, Socks, Shorts, Shirts, Sweaters,Sleepwear, Skirts and Dresses, and Swimwear from Women's are matchedagainst Pants1, Vests3, Socks1, Shorts1, Shirts1, Sweaters1, Sleepwear1,and Swimwear from Men's. Here, the matching is not so easy. There is noconcept in Men's that matches Slacks or Skirts and Dresses.

[0282] The window at 2118 shows what happens in such a case. When thealgorithm reaches Accessories <−−> Accessories1, there is no exact matchfor the concepts Handbags and Perfume from Women's or Bags and Colognefrom Men's. The matching algorithm does pick up the similarity betweenBags and Handbags and lists them as a matched pair at 2125 in matchedpair list 2119; it also correctly matches Perfume and Cologne, simplybecause they are the two that are left after the other matches arefound. The correct matching of Bags and Handbags and Cologne and Perfumeis a good example of how the effectiveness of the matching algorithm isincreased by the fact that it is applied within the structure providedby the models.

[0283] When the algorithm reaches Apparel <−−> Apparel1, we see asituation where input from the user is necessary to get the semanticallycorrect result. Here, the children with no matches are Slacks and Skirtsand Dresses in Apparel and Pants in Apparel1. The matching algorithm cando nothing at all with Slacks and matches Skirts and Dresses with Pantson the basis of the match between the an in and the an in Pants. Window2201 in FIG. 22 shows how all of this appears in the user interface.Subwindow 2107 now contains the unmatched candidate concept Slacks 2203.Matched pair list 2205 contains the erroneously matched pairPants<−−>Skirts and Dresses at2207.

[0284] Windows 2208 in FIGS. 22 and 2301 in FIG. 23 show how the usercan use the interface to deal with this situation. In window 2208, theuser has selected matched pair 2207 and pressed Split button 2111; as aresult, pair 2207 is removed from list 2213 and Pants appears at 2209 insubwindow 2105, while Skirts and Dresses appears in addition to Slacks2203 in subwindow 2107. The user of course recognizes the close semanticrelationship between Pants 2209 and Slacks 2203, so the user selectsthese concepts and then clicks on Join button 2109. The result is shownin screen 2301. The new matched pair Pants<−−>Slacks 2307 has been addedto list 2305 and Skirts and Dresses 2303 remains in window 2107 as ananomalous term. As indicated above, the prototype adds Skirts andDresses 2303 to the composite model C; in other embodiments, the usermay be asked whether to assign the term to the list of anomalies A, thevariability model V, the composite model C, or leave it in the remaindermodel.

[0285]FIG. 24 shows the resulting model constellation 2403, withcomposite Clothing model 2403 and variability model 2407. The names inmodel 2403 are simply the names from the canonical set. If changes aredeemed necessary, they are made using the standard Ariadne name changingcapabilities. In other embodiments, the user may have the option ofspecifying names for matched pairs.

[0286] As simple as the prototype embodiment is, it shows the power ofthe CFU techniques disclosed herein.

[0287] The algorithm shows the power of a mixed-initiative approach tofactoring. The algorithm can walk down the trees in parallel, presentingthe user with its best guess at each level as to how the matches shouldbe made. The user can interrupt at any time to correct these matches;then the walk continues based upon these corrections. In fact, thealgorithm can be tuned to be more intrusive (ask the user forconfirmation of every match), or less obtrusive (only ask the user whenthere are leftover concepts, or some of the concepts match particularlypoorly), as desired.

[0288] The algorithm shows how the context provided by the model can beused to leverage the power of even a simple string match. In the exampleof Cologne <−−> Perfume, the algorithm finds a match between twoanalogous concepts, based not on surface similarity between theirlabels, nor upon a large and comprehensive thesaurus, nor even uponsyntactic or morphological analysis of their labels, but simply on thecontext of the match, in which all the other items had reasonablesuperficial matches. In a situation in which the models being unifiedare very similar, (as is often the case in web catalogs), even suchsimple context-based matching can have a very powerful effect.

[0289] The algorithm automatically classifies any items from theoriginal trees into the appropriate categories in the Commonality andVariability models (this capability is not shown in these examples).This means, among other things, that further factoring can be done oneither of these models as appropriate.

[0290] As limited as the prototype is, it is useful for processingreal-world web indices to create a more flexible multi-dimensionalindex. In many cases, issues about merging categories (i.e., matchingtwo categories in one tree to a single category in another), levelchanging (i.e., matching a category at one level of one tree to acategory at another level of another tree) and category naming can bemanaged by pre- or post-processing the trees. For example, a failedmatch between Insulated Vests and Vests can be repaired by inserting anew concept Vests in one tree before running the algorithm. While,strictly speaking, this violates the functional nature of the transform,it does give the user considerable control over the decisions being madeabout how to interpret the tree, and allows even a simple algorithm toprocess real-world data.

[0291] Other implementations can be made that do not have thelimitations of the prototype. Among the limitations are:

[0292] This algorithm always treats an anomalous concept as a newcategory in the Commonality model; there is no provision for addingconcepts to the Variability model or leaving them in a neutral Remaindermodel.

[0293] This algorithm provides no support for matching more than oneitem from one tree to a single item from another. So for example, itcould not take any appropriate action should the canonical set include aconcept Hats and Gloves and the set of concepts being matched to thecanonical set include the separate concepts Hats and Gloves.

[0294] This algorithm looks for matches at a single level only; hence itcannot detect or treat any changes of levels. For example, if a firsttree had a concept Vests with subconcepts Insulated and Uninsulated, thealgorithm would not make a sensible match between the first tree and asecond tree, where the intermediate concept Vests is omitted, and twocategories called Insulated Vests and Uninsulated Vests appear at thesame level as Vests in the first tree.

[0295] This algorithm does not take advantage of syntactic clues inmatching concept names, e.g., the word and in Hats and Gloves.

[0296] The algorithm provides no option for the user to decide what nameto use for a new concept in C. In the example above, the name Colognewas used in the final output; this has to do with the order in which thealgorithm encountered the names. In another run, the concept mightreceive the name Perfume might be used for this concept. In no case willthe algorithm allow for the introduction of a new name (e.g., Fragrance)as part of the factoring process.

[0297] The prototype does not include any capabilities for changing thelevel in C at which a concept is represented.

[0298] The prototype provides no capabilities for retracting bindingsbetween concepts in the source model and concepts in the modelconstellation once they have been committed.

[0299] Procedure for Dealing with Subset Relations

[0300] In our discussion so far, we have presumed that it would bepossible to match concepts in one branch of a taxonomy with concepts inanother branch, as if the intent of the two concepts were comparablesets. We now consider the possibility that one of items to be matchedrefers to a subset of the items referred to by another item. We describea detailed procedure for walking through the models checking for theserelations.

[0301] Assume we have checked for direct correlations (matches) as inthe descriptions above. Also assume possible tuning of the matchingalgorithms to anticipate subset relations. For each concept in I_(i) weare basically going to decide one of the following paths:

[0302] The concept matches some sibling at the current level in C.

[0303] The concept is a super-category of one (or more) concepts alreadyin C.

[0304] The concept is a sub-category of one of the concepts already inC.

[0305] The concept is an overlap with one (or more) of the conceptsalready in C.

[0306] Various hybrids of the super-category, sub-category oroverlapping cases above.

[0307] Last but not least, we can decide the concept is truly anomalousat this point in the structure, and determine whether to allocate it toC, V or the remainder model.

[0308] Detailed descriptions of how to recognize each of these cases andhow to process them are provided below.

[0309] Concept I is a Superset/Super-Category of Parent Concept C

[0310] In this case we introduce the concept as a sub-category of thecurrent parent concept in C and make the other concepts children of theintroduced concept.

[0311] Note that we don't need to compare I to the current parent in C,and certainly do not need to look for matches at any point higher thanthat parent. We are positioned here based on the “working theory” thatour parent in I_(i) matched with the parent concept in C. Even if therewere a syntactic match, the current model state is effectively declaringthat there is a semantic distinction to be respected. An apparentsyntactic match with the parent, or one of its siblings (uncles) ortheir children (cousins) would conflict with the semantic assumptions ofthe taxonomic model type and our current position in the walk. We do notwant to overrule this claim on the basis of matching criteria. So theconcept in question cannot be more than a subset of the parent conceptin C. This property of the walk has the virtue of continually driving usdownward in the hierarchy of C.

[0312] Checking other siblings of C. There are, however, some specialcases and issues to be considered. The concept in I_(i) could be asuper-category of more than one sibling concept in C. In this case it isimportant to continue checking all other siblings in C, to move anyappropriate siblings under the newly introduced concept.

[0313] Interpreting superset as splitting vs. flattening. The supersetrelation could correspond to two different patterns: splitting orflattening. Suppose a is the concept in I which “splits” to two (ormore) concepts b1, b2 in C. If concept a in I has children thatcorrelate to b1, b2, then the pattern suggests flattening. If a does nothave such children, then in effect a has been “replaced” with b1, b2 inC and the pattern suggests splitting. In either case, since the basicsuperset relation holds between a and b1, we need to check all siblingsat that level in C before doing carryover.

[0314] Concept I is a Subset/Sub-Category of Parent Concept C

[0315] Let a be the concept in I_(i) and b the concept in C. Matchingheuristics suggest a is a subset/sub-category of b. In this case, wemake a recursive call to re-invoke the correlation routine to match aagainst the current children of b (the presumed super-category withinC). At this next level down we will make the same checks again: matchdirectly, introduce as a parent, recurse down below some presumedsuper-category, or add as an anomalous “new concept,” to be alllocatedeither to C or V at that point in processing. Once a has finally beendealt with, we must “pop the stack” and resume the walk at the nextsibling in I_(i).

[0316] Interpreting Subset as Merging or as Deepening. Suppose a1, a2are both concepts in I which merge to b in C. We can handle a1 and a2 inseparate passes of the iteration, since each independently will getdriven down as sub-categories below b with a recursive invocation. Notethat the subset relation could correspond to two possible patterns:merging or deepening. If b has no children that correspond to a1/a2, thescenario suggests a merge pattern. If b has children that correspondthis is more suggestive of a deepening operation; we will discover thiseach time we drive a1 (respectively, a2) down for the recursive match.

[0317] Overlapping

[0318] In this case, our actions are a blend of what we do for supersetand subset: we introduce a new “placeholder” concept, and place ourconcept along with the overlap concept as children of that new concept.

[0319] As in the superset transformation, we need to check othersiblings in C. If a in I overlaps b1 in C, it can also overlap b2 in C.If a overlaps b1, b2, then we need to create a new concept in C for {a,b1, b2}, a added as a child of that concept, b1 and b2 both “deepened”to be children under the concept. If we assume b1 and b2 are disjoint,then this structure will work.

[0320] Note that overlapping is a symmetrical relation between models Iand C, so we don't need two cases to consider. However, because of thisfact, and because of the similarities in overlap vs. superset, this onecase forces us to consider all siblings in I before we commit to the newstructure; if we don't do this we will need lots of extra flags, etc.,and more special-case processing later on. Thus an additional complexityis that we need to check other siblings in I_(i) for overlap as well.

[0321] In fact, in the case of overlap we may have an arbitrary numberof concepts in the sibling sets of I_(i) and C that overlap in a“chained” manner. The outcome we would want would be a single“synthetic” concept which represents the union of all these connectedoverlapping concepts. Assuming we adopt a processing approach thatresults in separate synthetic unions of this sort being processed atdifferent times during the correlation phase, we follow a rule that asynthetic category be merged with any synthetic category already inplace.

[0322] Furthermore, we adopt a rule which says that, when matchingsiblings in a set, we do not attempt to match synthetic concepts butinstead recurse down and match their children. (Since we do not generatesuccessive synthetic concepts there will be at most one level ofindirection here.)

[0323] By adopting these conventions for handling the “syntheticcategory” introduced for overlapping concepts, we solve the problem ofhaving to process all the siblings in I in parallel. We can now proceedsibling by sibling, with the confidence that each will be checkedappropriately.

[0324] Hybrids

[0325] We can also run into various hybrid situations, but not unlimitedones (otherwise they would violate the assumed semantics of the models).For example, a concept a in I_(i) might have an overlap with concept b1in C, and a subset or superset relation with another concept b2 in C.

[0326] Suppose a concept a in I overlaps b1 in C, and a is also asuperset of b2 (with b1, b2 disjoint). In this case, we want to wind upwith the following structure: Root ({a,b1}(a(b2),b1),); that is, b2 ismoved down to be a sub-category of a even though a is added as asub-category of the newly introduced {a,b1} category.

[0327] Suppose a overlaps b1, and a is a subset of b2. Then b1 would atleast have to overlap b2; since these are in C we can choose a protocolwhich enforces such situations away (e.g., insist on single-linksemantics for C). The same situation could happen symmetrically, i.e.,a1 overlaps b, a2 a superset of b.

[0328] Anomalies

[0329] Last but not least, we can decide the concept is truly anomalousat this point in the structure, and determine whether to allocate it toC, V or the remainder model. If it is allocated to C, then a new conceptis created in C corresponding to the concept in the input model, all theinstances are copied over, and a link is created from the concept inI_(i) to the new concept in C. The same procedure is followed for theother models.

[0330] One important difference must be supported when creating ananomalous concept in V. A feature facet must be created that links thatconcept to the current parent concept in C. This needs to be createdwhether or not the feature facet is enforced as an intensionalconstraint on instances created with the new C and V models. The otheressential role of this facet link has to do with continuing the CFUalgorithm's walk down the various input models.

[0331] Example: FIG. 25

[0332]FIG. 25 shows the desired output configurations for various typesof input conditions where we can infer a subset or overlappingrelationship between the incoming concept and a concept in C:

[0333] a) If the concept “Hats and Gloves” is in C and “Hats” in I_(i):add or match the new concept “Hats” as a child of “Hats and Gloves”,then return to the matching tour at the current sibling level (2501.

[0334] b) If we find “Hats and Gloves” in the input I_(i) and “Hats” isalready in C, introduce a new concept “Hats and Gloves” in C as a parentto “Hats”; demote “Hats” down from its current set of siblings to be achild of the new (or unified) concept “Hats and Gloves”. The instancesassociated with “Hats” will remain as they are; the instances associatedwith “Hats and Gloves” in I will be carried over to the new concept butwill not inherit to the child concept “Hats” (2503).

[0335] We assume there is no match for “Hats and Gloves” at the siblinglevel, or we would have already matched it. If inadvertently weintroduce some overlapping terms in the siblings of C, we may have thisdetected either by an actual name clash (the same name used twice forconcepts in C), or with overlapping extents.

[0336] c) When a superset relation is found as in case (b) above, wemust try to rematch the same concept to the other siblings within C.Here, we move on to compare “Hats and Gloves” with “Gloves”; finding thesubset relation, we match “Hats and Gloves” with the newly created “Hatsand Gloves”, and therefore attach “Gloves” to this concept via demotion(2505).

[0337] d) The final case to consider is where name syntax impliesoverlapping (rather than sub- or superset) relations between theconcepts in I_(i) and C. In this case the actions to perform are acomposite of those required for the other cases. We introduce a newconcept which is the union of the two overlapping sets (suggested by thename “Hats, Gloves, Ties”). This is the first case where we are forcedto create a new concept which might be artificial in nature. We make theother two concepts children of this synthetic concept (2507).

[0338] Note that for each of the patterns above there will typically besome elicitation from the user required in order for the transformationto be posted to the composite model C.

[0339] Matching in Inverse Order with Inferred Superset Semantics: FIG.26

[0340] The scenario above assumes that the composite model C is the onewith more level structure. Since trees can arrive in arbitrary order, weget the other case simply by assuming a reversal of the input order forthe trees. This is illustrated in FIG. 26. There are two relevantscenarios: one where there are syntactic or user-driven cues to identifythe superset relation when it is first encountered; the other where thisis not determined till later.

[0341] For the first case:

[0342] 1) Assume “Women's Shoes” as depicted at 2603 is the currentstate of the composite model C, and “Men's Shoes” 2601 is the new input.

[0343] 2) Sandals matches Sandals.

[0344] 3) “Boots” is a candidate anomaly.

[0345] 4) We now look for subset/overlap relations. “Boots” and “SkiBoots” gives a partial match (inverse of case above). That is, the matchsuggests “Boots” is a superset of “Ski Boots.”This tells us that we aregoing to want to introduce “Boots” and demote “Ski Boots” below it. (Wedo not attempt to match “Boots” again to some sibling of “Ski Boots” inC.)

[0346] 5) Since we have found a superset relation, we check the othersiblings of “Ski Boots” for other possible superset relations. Continueto try to match “Boots”, now to “Hiking Boots.”We find syntacticevidence of a superset relation with “Hiking Boots.” We now demote“Hiking Boots” to be a child of “Boots.”

[0347] 6) We are done with this level. Eventually we will move downwardin the input subtrees until we come to process “Hiking Boots” and “SkiBoots” as the sibling set under “Men's Shoes/Boots”. The match goeseasily for the two children of “Boots” in C, since both were created inthe earlier pass.

[0348] 7) Elicitation. At various points in the sequence above we maychoose to confirm or validate with the user. We can certainly validatethe correctness of the guessed matches and subset relations (or evenelicit these from the user.) In addition, we may want to offer the userthe choice of overriding the model “style” that will be chosen bedefault.

[0349] When we ask for elicitation from the user, we can offer the userone of two choices: a) Add “Boots” as a new concept under “Women'sShoes” and demote “Ski Boots” from being a direct child of “Women'sShoes” to a child of “Boots.” b) Prefer the “flattened” version where“Boots” goes away and “Ski boots” remains as is. In order for this to bea meaningful choice for the user to make, though, he or she will need tosee the other children of “Men's Shoes/Boots” to recognize that this isa potential flattening operation as opposed to a splitting operation.Here the need for interface 1601 to provide a “fisheye” view of therelevant context is clear.

[0350] Alternative Scenario. In this scenario we presume there are nosyntactic clues to guide the leveling search:

[0351] 1) Assume “Women's Shoes” as depicted at 2603 is the currentstate of the composite, and “Men's Shoes” 2601 is the new input.

[0352] 2) Once again, Sandals matches.

[0353] 3) “Boots” is anomalous. We ignore the partial match of Boots toSki Boots.

[0354] 4) We add “Boots” to C as a new (direct) child concept of“Women's Shoes.” Done with this round.

[0355] 5) We now (eventually, in the walk of the various input subtrees)recurse down to children “Hiking Boots” and “Ski Boots” as under “Men'sShoes/Boots” in the input model.

[0356] 6) There is currently nothing under “Boots” in the compositemodel C. So “Hiking Boots” will show up as anomalous and will be addedin. (If there is a direct name duplication, a warning may be flagged;otherwise, the “false duplicate” will fall into the model C.)

[0357] 7) Elicit intent from the user. We approach the same levelingchoice from the other direction.

[0358] 8) Repeat for Ski Boots.

[0359] Elicitation of Intent.

[0360] At the elicitation points called out above (Steps 6 and 7respectively) we come to a critical junction in the transform. Theprocedure does not have a built-in default preference in these cases; wehave it elicits a decision from the user.

[0361] The questions that determine the desired configuration hinge onthese semantic questions:

[0362] Are all Ski Boots Boots? If the answer is yes, go with the“child” position for Ski Boots; if the answer is no, then the flattenedversion is preferable. By accepting the flattened version, we assert thepossibility that some Ski Boots are not Boots.

[0363] Are all Boots either Ski or Hiking Boots? If the answer is Yes,the flattened version can be used; if the answer is No, the flattenedversion results in information loss. By accepting the flattened version,we also lose the direct concept Boots; if there are Boots that are notski or hiking boots we will lose information about these items, becausewe will not be able to allocate them more specifically.

[0364] Multi-Level Factoring: FIGS. 27, 28, and 29

[0365] The example shown in FIG. 27 illustrates an additionalcomplicating element: multiple factorings, as this interacts withfactoring flattened vs. relatively unflattened structures. Here we havethe plausible situation that a distinction important in the realm ofmen's clothes (2701) (formal vs. casual) is not deemed important forboys (2703). In addition, the same subtree repeats under categories“Formal” and “Casual” under Men's Clothes. (We assume the factoringinteraction is invoked from two roots, Men's Clothes and Boys' Clothes).

[0366] To clarify the precise problem faced in the transform, FIG. 28illustrates the exact state of the transform walk at the point where themulti-level occurrence of concept Shirts is discovered. We have finishedthe first level siblings and have begun the children under Formal, whichcorrelate to the currently empty subtree under the (suspect) conceptFormal in C 2805.

[0367] At the point in the procedure where the problem is discovered, wedo not have visibility yet onto the overall configuration. We do notknow that all the other children of Men's Clothes/Formal (in 2807) willhave matches in C 2805. We know only that an anomalous concept under asuspect concept has a match at the “uncle” level. This is our first hintthat a multi-level factoring problem may emerge.

[0368] The configuration raises, as usual, some subtle ontologicalquestions. In particular: are all Shirts Formal Shirts? If so, we caneliminate Shirts at the higher level in the taxonomy and leave it belowFormal. We can see, informally, from the original models that this isnot the case. Also, the content suggests in this case that the conditiondoes not hold-unlike Boots and Hiking Boots, in the earlier example,which suggested a clearer superset/subset relation. But there is littlestructural information in the model configuration as shown to allow usto conclude this. (Compare with the Boots example earlier and it will beapparent that the pattern of concept matches, levels, etc. is verysimilar in the two examples, but the intended semantics is quitedifferent.)

[0369] The following procedure handles this sort of anomaly.

[0370] We begin with an outcome like that pictured in FIG. 28. Note thatin this case, some duplication remains in the models.

[0371] But there is a problem. We need to allocate current Boys' Clothesinstances to the new models. With the configuration as shown, how do welink instances of, for example, Boys' Shoes? These can't be linked toShoes under either Formal or Casual, as these distinctions do not applyto the current extent of Boys' Clothes.

[0372] The following steps outline the performance of the algorithm(refer to FIGS. 27 and 28):

[0373] 1) Assume Boys' Clothes 2703 is the first set of siblings seen.These are carried over to C 2805 as is.

[0374] 2) Concept “Formal” in 2801 is anomalous, is carried over to C2805 as a “suspect” concept.

[0375] 3) Similarly with concept “Casual”.

[0376] 4) Jackets matches.

[0377] 5) Done with this level. C 2805 now has: Shirts, Pants, Shoes,Jackets, Formal, Casual, Jackets. Recurse to next set of siblings.

[0378] 6) Under Formal in 2701 we find “Shirts.” Anomalous since C 2805has no sub-structure as of yet below Formal.

[0379] 7) Because it takes place under a suspect concept, we search“Uncle” level concepts before adding it. We find a match.

[0380] 8) Once again, we face a modeling problem; which is preferable?Elicit the answer from the user, prompting with the semantic import ofthe different possible decisions:

[0381] If we leave Formal in C 2805 we are saying that Formal and Casualwere extensional attributes of Men's Clothes. There could be Boys'Formal and Boys' Casual, and the model simply had not articulated these.

[0382] If we make Formal part of V 2803, we are saying the category isrelevant only to Men's Clothes; i.e., it has intensional import in themodel.

[0383] Intensional Scenario. Assume we decide Formal is intensionallycorrelated to Men's Clothes. (See the discussion below for thealternative scenario.)

[0384] 1) We move Formal to V, unify Shirts, link instances to bothShirts in C and Men's/Formal in V.

[0385]  We now must reset the “current sibling list” in C from the(null) children list under Formal (a concept now placed in the otherfactor model V), back to the “uncle” level sibling list.

[0386]  At the same time, we must remember our context within the modelV, which is now positioned at Men's/Formal.

[0387] 2) Repeat the matching procedure for Pants. Only now, Pants isdirectly matched to its analogue in C. We link instances to that conceptand to Men's/Formal in V. Similarly for Shoes.

[0388] 3) We have finished the children of concept Formal in theoriginal input model I_(i). We now move on to the children of conceptCasual.

[0389]  We have to reset context in V back to Men.

[0390]  Similarly, we find the analogous (still “suspect”) conceptCasual in C (as of yet with no children).

[0391] 4) We match Shirts, which is anomalous under a suspect concept,so check siblings of Casual in C.

[0392] 5) Process as above (Steps 7-11) for children under Casual. Theresult is as shown in FIG. 29 (a) (2901).

[0393] Extensional Scenario. Starting from Step 8 above: Assume that,instead of Steps 9-13 above, we decide the concept Formal was onlyextensionally correlated with Men's Clothes. This means we are willingto allow for instances of Boy's Formal (and presumably, but notnecessarily, Casual) Clothing to be added to the model later.

[0394] In this option we choose the approach that takes account of theactual instances in the current configuration. In saying the conceptFormal is extensional, we are saying that instances of Boys' FormalShirts should be allowed in the model. We know, by informal inspectionof the original models, that there are currently no known instancesclassified this way (as the original models could not have captured thisinformation). So, only by re-classifying or further classifying currentinstances, or adding new ones, would we need to accommodate Boys' FormalShirts. This means we could simply follow the same approach as in theintensional case and move the concept Formal to V as in Step 9 above,and illustrated in FIG. 29 at 2901. If and when we need the new categorywe can introduce it by duplicating the concept Formal under Boys' in V,as indicated in FIG. 29 at 2905. This approach suggests that in thissituation we always build the same factored models regardless of theintensional or extensional status. Only in later model evolution mightwe duplicate Formal/Casual in model V. If this occurs, we have theoption of iteratively invoking factoring on that model. Model V willeffectively be submitted as the input source model, and the Gender modelwould result as V2, the “Formality” model as C2.

[0395] Modified Intensional Scenario.

[0396] Finally, there is another solution illustrated at 2907 in FIG.29. Recall that the semantic question “begged” by the multi-level matchwas: “Are all Shirts Formal?” Since “Boys' Shirts” don't use the“Formal” distinction, the answer is no. Yet there may be a significantsubset of shirts that are intensionally Formal. In this case, we cancreate a concept which stands for the subset of “Formal Shirts”. Thatis, instead of making “Shirts” a sub-concept of “Formal” as in theproblematic Figure (b), we make “Formal (Shirts)” a sub-concept ofShirts in model C. We may need to repeat this strategy for some siblingsof “Shirts” such as “Pants” and “Shoes.” So we are potentiallyintroducing some duplication back into C. However, this last solution(d) seems to address many of the concerns that have emerged in thediscussion so far:

[0397] We do not reduplicate Shirts, since this causes an assignmentproblem as we have seen.

[0398] We retain the ability to describe Boys' Shirts without use of theFormal characteristic. (In fact, these instances stay allocated to theconcept they were allocated to previously.)

[0399] Men's Formal shirts can be allocated to the new concept withoutloss of semantic expressiveness.

[0400] The model can accommodate later evolution of the modelintroducing Boys' Formal (or Casual) Shirts. (Similarly, Men's Shirtswithout the qualification of Formal vs. Casual could be supported.)

[0401] Although the end result has duplication (literally, a “flippingof the axes” for the original model) this also creates clear conditionsfor iterative execution of the transform-with the subtrees rooted atconcepts Shirts, Pants, and Shoes within the Clothes model (Model C in2907 of FIG. 29) as the three inputs. This iterative execution of thetransform will result in a constellation 3001 of the same three modelswe might have intuitively derived when viewing the three factors, asillustrated in FIG. 30. (Note that we would arrive at the same threemodels 3001 if we had chosen the alternative at 2905. However, in thiscase C rather than V would be submitted as the new input model, to thesecond iteration of the transform; and the Clothing model would be V2,the Formality model C2).

[0402] Further Sequencing.

[0403] Returning to the illustration in FIG. 28, suppose we havefinished processing Men's Clothes/Formal/Shirts and now move on toPants. In order for the algorithm to work out properly it is importantthat the suspect concept Clothes/formal still be in C.

[0404] The rule of thumb is that, if a suspect concept is added to C, itshould remain in the model until the subtrees underneath the conceptsfrom (all!) the original input models I_(i) have been allocated. Onceall these instances have been assigned, we can do a “cleanup” pass overC. If none of the matches have resulted in “utilization” of the concept(no instances assigned to it, no children created for it, no featurelinks or constraints) then it can be removed.

[0405] Allocation: FIG. 31

[0406] When a single taxonomy has redundant substructure, the subtreesto be factored will in some cases be literal copies that can be matchedvia simple textual comparison. Where this is the case an almostmechanical and largely automated procedure is sufficient to facilitatefactoring. However, in most factoring transformations some but not allthe concepts in one input subtree will have clear analogies in theothers.

[0407] Formal Problems in Allocating Anomalous Concepts: FIG. 31

[0408] We term anomalous concepts (or, in this context, simply“anomalies”) as a concept which, at the time that correlations are madebetween the input subtree and C, is deemed to have no match or subsetrelation in C and therefore becomes a candidate to be added as a newconcept, either into C or V. Once added into C, we continue to use theterm “anomaly” informally to mean a concept that has linkage to only oneinput concept. The anomaly may be the result of originally replicatedmodels that were subject to different diverging modifications indifferent contexts, or models that were independently developed todescribe analogous subject areas. It may also be a transient artifact ofthe sequencing of input trees in the CFU “walk”; so that concepts whichare treated as anomalies when they first enter the model C willtypically become matches later on.

[0409] As an example of the problem of anomalous concepts, suppose wehave the original models 3102 and 3103 shown in FIG. 31. The factoredmodels will be as shown at 3105. But where should we place the conceptBras from source model 3103? Since Bra shows up under Wommen's Clothingand not under Men's Clothing, what is the status of the missingcategory? How, if at all, do we preserve the implicit informationconveyed (or implied) by the original models (i.e., that there are nomen's bras)? Is this implicit information actually what the modelersintended? Is it correct, or have we discovered an opportunity forinnovation?

[0410] CFU provides a systematic “walkthrough” of the models, promptingusers for qualitative elicitation and analysis at key points. Thoughoverall procedure is far more streamlined and efficient than manualmodification would be, there is still a key “human in the loop”component.

[0411] Handling Anomalies

[0412] In the following paragraphs we outline a procedure for handlinganomalous concepts. For simplicity we consider a scenario with two setsof sibling concepts to be matched and one concept that is clearly ananomaly according to the matching protocol employed. We need to decidewhere the anomalous concept will reside.

[0413] Recommended Procedure. The algorithm for handling anomaliesproceeds in the following way:

[0414] 1) Overall criteria and defaults are established for thefactoring pass.

[0415] 2) The trees are “walked” by the main CFU algorithm, resulting incomparisons of a given group of concept “sibling sets” in multiple inputsubtrees.

[0416] 3) Matching criteria are applied to find analogous conceptswithin the various subtrees. These matching criteria may try to takemany factors into account (such as Ariadne decorations, possible splits,merges and level shifts, or the evidence of overlapping extents of theconcepts).

[0417] 4) Once anomalous concepts have been identified, the user ispresented with a set of choices for how to allocate the concepts. Thesemantic implications of the choices can also be made clear through theinteraction (with varying degrees of explanatory support provided).

[0418] 5) Depending on the approach to the walk of the input models,look-ahead and user choices, it may be necessary in some cases tobacktrack, undo previous decisions, or otherwise modify the outputmodels as part of the procedure.

[0419] This strategy depends on the following criteria:

[0420] Is the commonality model C to be produced intended to representthe intersection of the synthesized subtrees (that is, only conceptsthat occur in all the subtrees) or their union (concepts that occur inany of the subtrees?

[0421] Is intensionality expressed or implied by occurrence ornon-occurrence of particular concepts in different subtrees? In ourexample, there is no concept Bras under Men's Clothes. Do we assume thatthe models are exhaustive in describing what is in the world; i.e., thatthere are no men's bras? Or do we take the position that absence of theconcept does not necessarily imply emptiness of the category?

[0422] Similar issues arise regarding the extensionality expressed orimplied by the instances associated with each concept.

[0423] Case by Case Allocation of Anomalous Concepts

[0424] This is a hybrid strategy that employs different strategies on aconcept-by-concept basis. The major issue here is how to decide when toplace the concept into C, the “composite” factor (In our example, Braswould be added as a concept under Clothing), and when place the conceptinto V, the “variability” factor (In our example, Bras would be placedunder Women). If the user determines that the intent of the factoring isthat some concept really should be in the intersection of all models(e.g., there should be a possibility for men's skirts, even if none werepresent in the original men's model), then the concept goes into thecommonality factor C. If, on the other hand, the intent of the factoringis that the concept really is unique to some input models (e.g.,maternity clothing is essentially for women only, and never for men),the concept goes into the variability model V.

[0425] Recommended Strategy. Given the alternatives above, therecommendation for handling anomalous concepts in this implementation ofthe CFU transform is as follows:

[0426] Establish the intended scope for model C at the start of theinteraction. In particular, choose whether we want the model to reflectunion vs. intersection semantics. By default, intersection semantics isused. This helps ensure the overall semantic coherence of the variousmodels.

[0427] Elicit intended semantics for anomalous concepts on acase-by-case basis per anomalous concept, to determine which constraintsto link at the concept.

[0428] CFU Procedure in the Case of Anomalous Concepts

[0429] The following steps describe a desired interaction to assist theuser in allocating anomalous concepts.

[0430] Defaults. For anomalous concepts, we first need defaults forwhere anomalous concepts should be placed?

[0431] If we have choose “intersect” protocol for the C model, bydefault we will tend to exclude concepts that suggest “gaps” (likeswimsuits when there are no swimsuits to offer). After the initial case,under intersect protocol the default position for an anomalous conceptwould be in the appropriate subtree of V.

[0432] If we have chosen “union” protocol for the composite model, thedefault position for an anomalous concept would be added into thecomposite model C.

[0433] We could also specify that anomalies are directed by default toT_(i)′.

[0434] Elicitation. For each anomalous concept we can ask the followingquestions (using the “bra” example):

[0435] “Are there already male bras in the current extent?” (This can betested autonomously by an Ariadne agent.) In this case, the anomaly inthe concepts would likely be a result of poor classification and thecase would resolve to the more typical matching case.

[0436] “Is there a men's undergarment (i.e., closest relevant matchedconcept in the two subtrees) which is the equivalent of a bra?” Here weare testing the intensional gap. If we find a match, we have revertedback to a potential correlation and fallen out of anomaly processing;however, we may need to handle naming preservation differently.

[0437] “Are all bras women's undergarments?”

[0438] (No—men's support bras) put the concept in the Clothing (C)model. The “gap” is extensional in nature rather than essential; we donot need to consider adding a constraint to enforce the correlation ofBras with Women's Clothes. (On the contrary, we might want to elicit aplaceholder for a “counter-instance” to be placed as an instance. Thecounter-instance could be tracked (via an infrastructure concept) as aparticular category of instance that does not correspond to a specificitem. By creating this instance it would exclude “men's bra” from beingcaught in a gap analysis run on the models. Strictly speaking this isgoing beyond the scope of processing required for factoring andsupporting pre-work for gap analysis.)

[0439] Yes: We have determined a necessary feature of the concept Bra.Put the concept Bras in the Gender (V) model w/no constraint.

CONCLUSION

[0440] The inventors have disclosed to those skilled in the pertinentarts the best mode presently known to them of making and using systemswhich perform the CFU transform on input graphs. While the transform canbe used with particular advantage with taxonomies that representcatalogs of items, it can be used with any graphs having facet typessuch that a hierarchical walk through the graph is possible. Theinventors have disclosed a prototype implementation of their inventionand have also disclosed how other embodiments may use correlationtechniques different from those employed in the prototype and may dealwith correlations at different levels of the graphs. While the prototypeis implemented in the Ariadne system, the techniques of the inventioncan be used generally with graphs and do not require the use of theAriadne system. The inventors have further disclosed two different userinterfaces for indicating whether the nodes belonging to a tuple ofnodes are in fact analogous; other embodiments may employ other userinterfaces. For all of the foregoing reasons, the Detailed Descriptionis to be regarded as being in all respects exemplary and notrestrictive, and the breadth of the invention disclosed here in is to bedetermined not from the Detailed Description, but rather from the claimsas interpreted with the full breadth permitted by the patent laws.

What is claimed is:
 1. A method performed in a system having aprocessor, a memory accessible thereto, and a user interface of unifyingchild nodes of a plurality of parent nodes from one or more other graphsinto composite graph child nodes of a composite graph parent node in acomposite graph, the graphs being stored in the memory and the methodcomprising the steps performed by the processor of: correlating thechild nodes, including any of the composite graph child nodes, toproduce one or more sets of possibly analogous nodes; displaying arepresentation of the set of possibly analogous nodes in the userinterface and receiving an indication via the user interface whethernodes in the represented set are taken to be analogous; and makingsiblings of the composite graph child nodes of the composite graphparent node as required to provide a composite graph child nodecorresponding to each of the indicated sets of analogous nodes.
 2. Amethod performed in a system having a processor and a memory accessiblethereto of correlating a node in a first graph with a possibly analogousnode in a second graph, both graphs being stored in the memory and themethod comprising the steps performed by the processor of: analyzing thefirst node's relationship to another node in the first graph to obtain afirst result; analyzing the second node's relationship to another nodein the second graph to obtain a second result; and using the results todetermine at least in part whether the first node is correlated with thesecond node.